spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Siegmann <daniel.siegm...@teamaol.com>
Subject Re: Spark ML - Scaling logistic regression for many features
Date Mon, 07 Mar 2016 21:54:31 GMT
Yes, it is a SparseVector. Most rows only have a few features, and all the
rows together only have tens of thousands of features, but the vector size
is ~ 20 million because that is the largest feature.

On Mon, Mar 7, 2016 at 4:31 PM, Devin Jones <devin.jones@columbia.edu>
wrote:

> Hi,
>
> Which data structure are you using to train the model? If you haven't
> tried yet, you should consider the SparseVector
>
>
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.SparseVector
>
>
> On Mon, Mar 7, 2016 at 4:03 PM, Daniel Siegmann <
> daniel.siegmann@teamaol.com> wrote:
>
>> I recently tried to a model using
>> org.apache.spark.ml.classification.LogisticRegression on a data set
>> where the feature vector size was around ~20 million. It did *not* go
>> well. It took around 10 hours to train on a substantial cluster.
>> Additionally, it pulled a lot data back to the driver - I eventually set --conf
>> spark.driver.memory=128g --conf spark.driver.maxResultSize=112g when
>> submitting.
>>
>> Attempting the same application on the same cluster with the feature
>> vector size reduced to 100k took only ~ 9 minutes. Clearly there is an
>> issue with scaling to large numbers of features. I'm not doing anything
>> fancy in my app, here's the relevant code:
>>
>> val lr = new LogisticRegression().setRegParam(1)
>> val model = lr.fit(trainingSet.toDF())
>>
>> In comparison, a coworker trained a logistic regression model on her
>> *laptop* using the Java library liblinear in just a few minutes. That's
>> with the ~20 million-sized feature vectors. This suggests to me there is
>> some issue with Spark ML's implementation of logistic regression which is
>> limiting its scalability.
>>
>> Note that my feature vectors are *very* sparse. The maximum feature is
>> around 20 million, but I think there are only 10's of thousands of features.
>>
>> Has anyone run into this? Any idea where the bottleneck is or how this
>> problem might be solved?
>>
>> One solution of course is to implement some dimensionality reduction. I'd
>> really like to avoid this, as it's just another thing to deal with - not so
>> hard to put it into the trainer, but then anything doing scoring will need
>> the same logic. Unless Spark ML supports this out of the box? An easy way
>> to save / load a model along with the dimensionality reduction logic so
>> when transform is called on the model it will handle the dimensionality
>> reduction transparently?
>>
>> Any advice would be appreciated.
>>
>> ~Daniel Siegmann
>>
>
>

Mime
View raw message