spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Devin Jones <devin.jo...@columbia.edu>
Subject Re: Spark ML - Scaling logistic regression for many features
Date Mon, 07 Mar 2016 21:31:50 GMT
Hi,

Which data structure are you using to train the model? If you haven't tried
yet, you should consider the SparseVector

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.SparseVector


On Mon, Mar 7, 2016 at 4:03 PM, Daniel Siegmann <daniel.siegmann@teamaol.com
> wrote:

> I recently tried to a model using
> org.apache.spark.ml.classification.LogisticRegression on a data set where
> the feature vector size was around ~20 million. It did *not* go well. It
> took around 10 hours to train on a substantial cluster. Additionally, it
> pulled a lot data back to the driver - I eventually set --conf
> spark.driver.memory=128g --conf spark.driver.maxResultSize=112g when
> submitting.
>
> Attempting the same application on the same cluster with the feature
> vector size reduced to 100k took only ~ 9 minutes. Clearly there is an
> issue with scaling to large numbers of features. I'm not doing anything
> fancy in my app, here's the relevant code:
>
> val lr = new LogisticRegression().setRegParam(1)
> val model = lr.fit(trainingSet.toDF())
>
> In comparison, a coworker trained a logistic regression model on her
> *laptop* using the Java library liblinear in just a few minutes. That's
> with the ~20 million-sized feature vectors. This suggests to me there is
> some issue with Spark ML's implementation of logistic regression which is
> limiting its scalability.
>
> Note that my feature vectors are *very* sparse. The maximum feature is
> around 20 million, but I think there are only 10's of thousands of features.
>
> Has anyone run into this? Any idea where the bottleneck is or how this
> problem might be solved?
>
> One solution of course is to implement some dimensionality reduction. I'd
> really like to avoid this, as it's just another thing to deal with - not so
> hard to put it into the trainer, but then anything doing scoring will need
> the same logic. Unless Spark ML supports this out of the box? An easy way
> to save / load a model along with the dimensionality reduction logic so
> when transform is called on the model it will handle the dimensionality
> reduction transparently?
>
> Any advice would be appreciated.
>
> ~Daniel Siegmann
>

Mime
View raw message