spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Devin Jones <>
Subject Re: Spark ML - Scaling logistic regression for many features
Date Mon, 07 Mar 2016 21:31:50 GMT

Which data structure are you using to train the model? If you haven't tried
yet, you should consider the SparseVector

On Mon, Mar 7, 2016 at 4:03 PM, Daniel Siegmann <
> wrote:

> I recently tried to a model using
> on a data set where
> the feature vector size was around ~20 million. It did *not* go well. It
> took around 10 hours to train on a substantial cluster. Additionally, it
> pulled a lot data back to the driver - I eventually set --conf
> spark.driver.memory=128g --conf spark.driver.maxResultSize=112g when
> submitting.
> Attempting the same application on the same cluster with the feature
> vector size reduced to 100k took only ~ 9 minutes. Clearly there is an
> issue with scaling to large numbers of features. I'm not doing anything
> fancy in my app, here's the relevant code:
> val lr = new LogisticRegression().setRegParam(1)
> val model =
> In comparison, a coworker trained a logistic regression model on her
> *laptop* using the Java library liblinear in just a few minutes. That's
> with the ~20 million-sized feature vectors. This suggests to me there is
> some issue with Spark ML's implementation of logistic regression which is
> limiting its scalability.
> Note that my feature vectors are *very* sparse. The maximum feature is
> around 20 million, but I think there are only 10's of thousands of features.
> Has anyone run into this? Any idea where the bottleneck is or how this
> problem might be solved?
> One solution of course is to implement some dimensionality reduction. I'd
> really like to avoid this, as it's just another thing to deal with - not so
> hard to put it into the trainer, but then anything doing scoring will need
> the same logic. Unless Spark ML supports this out of the box? An easy way
> to save / load a model along with the dimensionality reduction logic so
> when transform is called on the model it will handle the dimensionality
> reduction transparently?
> Any advice would be appreciated.
> ~Daniel Siegmann

View raw message