spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Pentreath (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (SPARK-14489) RegressionEvaluator returns NaN for ALS in Spark ml
Date Tue, 28 Feb 2017 14:18:45 GMT

     [ https://issues.apache.org/jira/browse/SPARK-14489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Nick Pentreath resolved SPARK-14489.
------------------------------------
       Resolution: Fixed
    Fix Version/s: 2.2.0

Issue resolved by pull request 12896
[https://github.com/apache/spark/pull/12896]

> RegressionEvaluator returns NaN for ALS in Spark ml
> ---------------------------------------------------
>
>                 Key: SPARK-14489
>                 URL: https://issues.apache.org/jira/browse/SPARK-14489
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 1.6.0
>         Environment: AWS EMR
>            Reporter: Boris Clémençon 
>            Assignee: Nick Pentreath
>              Labels: patch
>             Fix For: 2.2.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> When building a Spark ML pipeline containing an ALS estimator, the metrics "rmse", "mse",
"r2" and "mae" all return NaN. 
> The reason is in CrossValidator.scala line 109. The K-folds are randomly generated. For
large and sparse datasets, there is a significant probability that at least one user of the
validation set is missing in the training set, hence generating a few NaN estimation with
transform method and NaN RegressionEvaluator's metrics too. 
> Suggestion to fix the bug: remove the NaN values while computing the rmse or other metrics
(ie, removing users or items in validation test that is missing in the learning set). Send
logs when this happen.
> Issue SPARK-14153 seems to be the same pbm
> {code:title=Bar.scala|borderStyle=solid}
>     val splits = MLUtils.kFold(dataset.rdd, $(numFolds), 0)
>     splits.zipWithIndex.foreach { case ((training, validation), splitIndex) =>
>       val trainingDataset = sqlCtx.createDataFrame(training, schema).cache()
>       val validationDataset = sqlCtx.createDataFrame(validation, schema).cache()
>       // multi-model training
>       logDebug(s"Train split $splitIndex with multiple sets of parameters.")
>       val models = est.fit(trainingDataset, epm).asInstanceOf[Seq[Model[_]]]
>       trainingDataset.unpersist()
>       var i = 0
>       while (i < numModels) {
>         // TODO: duplicate evaluator to take extra params from input
>         val metric = eval.evaluate(models(i).transform(validationDataset, epm(i)))
>         logDebug(s"Got metric $metric for model trained with ${epm(i)}.")
>         metrics(i) += metric
>         i += 1
>       }
>       validationDataset.unpersist()
>     }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message