spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michał Zieliński <zielinski.mich...@gmail.com>
Subject Re: Spark ML - Scaling logistic regression for many features
Date Mon, 07 Mar 2016 22:35:27 GMT
We're using SparseVector columns in a DataFrame, so they are definitely
supported. But maybe for LR some implicit magic is happening inside.

On 7 March 2016 at 23:04, Devin Jones <devin.jones@columbia.edu> wrote:

> I could be wrong but its possible that toDF populates a dataframe which I
> understand do not support sparsevectors at the moment.
>
> If you use the MlLib logistic regression implementation (not ml) you can
> pass the RDD[LabeledPoint] data type directly to the learner.
>
>
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
>
> Only downside is that you can't use the pipeline framework from spark ml.
>
> Cheers,
> Devin
>
>
>
> On Mon, Mar 7, 2016 at 4:54 PM, Daniel Siegmann <
> daniel.siegmann@teamaol.com> wrote:
>
>> Yes, it is a SparseVector. Most rows only have a few features, and all
>> the rows together only have tens of thousands of features, but the vector
>> size is ~ 20 million because that is the largest feature.
>>
>> On Mon, Mar 7, 2016 at 4:31 PM, Devin Jones <devin.jones@columbia.edu>
>> wrote:
>>
>>> Hi,
>>>
>>> Which data structure are you using to train the model? If you haven't
>>> tried yet, you should consider the SparseVector
>>>
>>>
>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.linalg.SparseVector
>>>
>>>
>>> On Mon, Mar 7, 2016 at 4:03 PM, Daniel Siegmann <
>>> daniel.siegmann@teamaol.com> wrote:
>>>
>>>> I recently tried to a model using
>>>> org.apache.spark.ml.classification.LogisticRegression on a data set
>>>> where the feature vector size was around ~20 million. It did *not* go
>>>> well. It took around 10 hours to train on a substantial cluster.
>>>> Additionally, it pulled a lot data back to the driver - I eventually set
--conf
>>>> spark.driver.memory=128g --conf spark.driver.maxResultSize=112g when
>>>> submitting.
>>>>
>>>> Attempting the same application on the same cluster with the feature
>>>> vector size reduced to 100k took only ~ 9 minutes. Clearly there is an
>>>> issue with scaling to large numbers of features. I'm not doing anything
>>>> fancy in my app, here's the relevant code:
>>>>
>>>> val lr = new LogisticRegression().setRegParam(1)
>>>> val model = lr.fit(trainingSet.toDF())
>>>>
>>>> In comparison, a coworker trained a logistic regression model on her
>>>> *laptop* using the Java library liblinear in just a few minutes.
>>>> That's with the ~20 million-sized feature vectors. This suggests to me
>>>> there is some issue with Spark ML's implementation of logistic regression
>>>> which is limiting its scalability.
>>>>
>>>> Note that my feature vectors are *very* sparse. The maximum feature is
>>>> around 20 million, but I think there are only 10's of thousands of features.
>>>>
>>>> Has anyone run into this? Any idea where the bottleneck is or how this
>>>> problem might be solved?
>>>>
>>>> One solution of course is to implement some dimensionality reduction.
>>>> I'd really like to avoid this, as it's just another thing to deal with -
>>>> not so hard to put it into the trainer, but then anything doing scoring
>>>> will need the same logic. Unless Spark ML supports this out of the box? An
>>>> easy way to save / load a model along with the dimensionality reduction
>>>> logic so when transform is called on the model it will handle the
>>>> dimensionality reduction transparently?
>>>>
>>>> Any advice would be appreciated.
>>>>
>>>> ~Daniel Siegmann
>>>>
>>>
>>>
>>
>

Mime
View raw message