spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Devin Jones <>
Subject Re: Spark ML - Scaling logistic regression for many features
Date Mon, 07 Mar 2016 22:04:51 GMT
I could be wrong but its possible that toDF populates a dataframe which I
understand do not support sparsevectors at the moment.

If you use the MlLib logistic regression implementation (not ml) you can
pass the RDD[LabeledPoint] data type directly to the learner.

Only downside is that you can't use the pipeline framework from spark ml.


On Mon, Mar 7, 2016 at 4:54 PM, Daniel Siegmann <
> wrote:

> Yes, it is a SparseVector. Most rows only have a few features, and all
> the rows together only have tens of thousands of features, but the vector
> size is ~ 20 million because that is the largest feature.
> On Mon, Mar 7, 2016 at 4:31 PM, Devin Jones <>
> wrote:
>> Hi,
>> Which data structure are you using to train the model? If you haven't
>> tried yet, you should consider the SparseVector
>> On Mon, Mar 7, 2016 at 4:03 PM, Daniel Siegmann <
>>> wrote:
>>> I recently tried to a model using
>>> on a data set
>>> where the feature vector size was around ~20 million. It did *not* go
>>> well. It took around 10 hours to train on a substantial cluster.
>>> Additionally, it pulled a lot data back to the driver - I eventually set --conf
>>> spark.driver.memory=128g --conf spark.driver.maxResultSize=112g when
>>> submitting.
>>> Attempting the same application on the same cluster with the feature
>>> vector size reduced to 100k took only ~ 9 minutes. Clearly there is an
>>> issue with scaling to large numbers of features. I'm not doing anything
>>> fancy in my app, here's the relevant code:
>>> val lr = new LogisticRegression().setRegParam(1)
>>> val model =
>>> In comparison, a coworker trained a logistic regression model on her
>>> *laptop* using the Java library liblinear in just a few minutes. That's
>>> with the ~20 million-sized feature vectors. This suggests to me there is
>>> some issue with Spark ML's implementation of logistic regression which is
>>> limiting its scalability.
>>> Note that my feature vectors are *very* sparse. The maximum feature is
>>> around 20 million, but I think there are only 10's of thousands of features.
>>> Has anyone run into this? Any idea where the bottleneck is or how this
>>> problem might be solved?
>>> One solution of course is to implement some dimensionality reduction.
>>> I'd really like to avoid this, as it's just another thing to deal with -
>>> not so hard to put it into the trainer, but then anything doing scoring
>>> will need the same logic. Unless Spark ML supports this out of the box? An
>>> easy way to save / load a model along with the dimensionality reduction
>>> logic so when transform is called on the model it will handle the
>>> dimensionality reduction transparently?
>>> Any advice would be appreciated.
>>> ~Daniel Siegmann

View raw message