spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiangrui Meng <men...@gmail.com>
Subject Re: [ml] Lost persistence for fold in crossvalidation.
Date Tue, 17 Feb 2015 23:12:21 GMT
There are three different regParams defined in the grid and there are
tree folds. For simplicity, we didn't split the dataset into three and
reuse them, but do the split for each fold. Then we need to cache 3*3
times. Note that the pipeline API is not yet optimized for
performance. It would be nice to optimize its perforamnce in 1.4.
-Xiangrui

On Wed, Feb 11, 2015 at 11:13 AM, Peter Rudenko <petro.rudenko@gmail.com> wrote:
> Hi i have a problem. Using spark 1.2 with Pipeline + GridSearch +
> LogisticRegression. I’ve reimplemented LogisticRegression.fit method and
> comment out instances.unpersist()
>
> |override  def  fit(dataset:SchemaRDD,
> paramMap:ParamMap):LogisticRegressionModel  = {
>     println(s"Fitting dataset ${dataset.take(1000).toSeq.hashCode()} with
> ParamMap $paramMap.")
>     transformSchema(dataset.schema, paramMap, logging =true)
>     import  dataset.sqlContext._
>     val  map  =  this.paramMap ++ paramMap
>     val  instances  =  dataset.select(map(labelCol).attr,
> map(featuresCol).attr)
>       .map {
>         case  Row(label:Double, features:Vector) =>
>           LabeledPoint(label, features)
>       }
>
>     if  (instances.getStorageLevel ==StorageLevel.NONE) {
>       println("Instances not persisted")
>       instances.persist(StorageLevel.MEMORY_AND_DISK)
>     }
>
>      val  lr  =  (new  LogisticRegressionWithLBFGS)
>       .setValidateData(false)
>       .setIntercept(true)
>     lr.optimizer
>       .setRegParam(map(regParam))
>       .setNumIterations(map(maxIter))
>     val  lrm  =  new  LogisticRegressionModel(this, map,
> lr.run(instances).weights)
>     //instances.unpersist()
>     // copy model params
>     Params.inheritValues(map,this, lrm)
>     lrm
>   }
> |
>
> CrossValidator feeds the same SchemaRDD for each parameter (same hash code),
> but somewhere cache being flushed. The memory is enough. Here’s the output:
>
> |Fitting dataset 2051470010 with ParamMap {
>     DRLogisticRegression-f35ae4d3-regParam: 0.1
> }.
> Instances not persisted
> Fitting dataset 2051470010 with ParamMap {
>     DRLogisticRegression-f35ae4d3-regParam: 0.01
> }.
> Instances not persisted
> Fitting dataset 2051470010 with ParamMap {
>     DRLogisticRegression-f35ae4d3-regParam: 0.001
> }.
> Instances not persisted
> Fitting dataset 802615223 with ParamMap {
>     DRLogisticRegression-f35ae4d3-regParam: 0.1
> }.
> Instances not persisted
> Fitting dataset 802615223 with ParamMap {
>     DRLogisticRegression-f35ae4d3-regParam: 0.01
> }.
> Instances not persisted
> |
>
> I have 3 parameters in GridSearch and 3 folds for CrossValidation:
>
> |
> val  paramGrid  =  new  ParamGridBuilder()
>   .addGrid(model.regParam,Array(0.1,0.01,0.001))
>   .build()
>
> crossval.setEstimatorParamMaps(paramGrid)
> crossval.setNumFolds(3)
> |
>
> I assume that the data should be read and cached 3 times (1 to
> numFolds).combinations(2) and be independent from number of parameters. But
> i have 9 times data being read and cached.
>
> Thanks,
> Peter Rudenko
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Mime
View raw message