spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "zhengruifeng (JIRA)" <>
Subject [jira] [Commented] (SPARK-18608) Spark ML algorithms that check RDD cache level for internal caching double-cache data
Date Thu, 02 Feb 2017 11:52:51 GMT


zhengruifeng commented on SPARK-18608:

[~yuhaoyan] Agree that it's nice to add an extra parameter:

in {Predictor}
  override def fit(dataset: Dataset[_]): M = {
    val handlePersistence = dataset.storageLevel == StorageLevel.NONE
    copyValues(train(casted, handlePersistence).setParent(this))

protected def train(dataset: Dataset[_], handlePersistence: Boolean): M

~~protected def train(dataset: Dataset[_]): M~~  //delete this

In each classification and regression algorithms, override the new {{train}} api instead of
old one.

For clustering algorithms, directly modify the {{fit}} method.

Since SPARK-16063 was already resolved, is there someone working on this?

> Spark ML algorithms that check RDD cache level for internal caching double-cache data
> -------------------------------------------------------------------------------------
>                 Key: SPARK-18608
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>            Reporter: Nick Pentreath
> Some algorithms in Spark ML (e.g. {{LogisticRegression}}, {{LinearRegression}}, and I
believe now {{KMeans}}) handle persistence internally. They check whether the input dataset
is cached, and if not they cache it for performance.
> However, the check is done using {{dataset.rdd.getStorageLevel == NONE}}. This will actually
always be true, since even if the dataset itself is cached, the RDD returned by {{dataset.rdd}}
will not be cached.
> Hence if the input dataset is cached, the data will end up being cached twice, which
is wasteful.
> To see this:
> {code}
> scala> import
> import
> scala> val df = spark.range(10).toDF("num")
> df: org.apache.spark.sql.DataFrame = [num: bigint]
> scala> df.storageLevel == StorageLevel.NONE
> res0: Boolean = true
> scala> df.persist
> res1: df.type = [num: bigint]
> scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK
> res2: Boolean = true
> scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK
> res3: Boolean = false
> scala> df.rdd.getStorageLevel == StorageLevel.NONE
> res4: Boolean = true
> {code}
> Before SPARK-16063, there was no way to check the storage level of the input {{DataSet}},
but now we can, so the checks should be migrated to use {{dataset.storageLevel}}.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message