We're using SparseVector columns in a DataFrame, so they are definitely supported. But maybe for LR some implicit magic is happening inside.

On 7 March 2016 at 23:04, Devin Jones <devin.jones@columbia.edu> wrote:
I could be wrong but its possible that toDF populates a dataframe which I understand do not support sparsevectors at the moment. 

If you use the MlLib logistic regression implementation (not ml) you can pass the RDD[LabeledPoint] data type directly to the learner. 


Only downside is that you can't use the pipeline framework from spark ml. 


On Mon, Mar 7, 2016 at 4:54 PM, Daniel Siegmann <daniel.siegmann@teamaol.com> wrote:
Yes, it is a SparseVector. Most rows only have a few features, and all the rows together only have tens of thousands of features, but the vector size is ~ 20 million because that is the largest feature.

On Mon, Mar 7, 2016 at 4:31 PM, Devin Jones <devin.jones@columbia.edu> wrote:

Which data structure are you using to train the model? If you haven't tried yet, you should consider the SparseVector

On Mon, Mar 7, 2016 at 4:03 PM, Daniel Siegmann <daniel.siegmann@teamaol.com> wrote:
I recently tried to a model using org.apache.spark.ml.classification.LogisticRegression on a data set where the feature vector size was around ~20 million. It did not go well. It took around 10 hours to train on a substantial cluster. Additionally, it pulled a lot data back to the driver - I eventually set --conf spark.driver.memory=128g --conf spark.driver.maxResultSize=112g when submitting.

Attempting the same application on the same cluster with the feature vector size reduced to 100k took only ~ 9 minutes. Clearly there is an issue with scaling to large numbers of features. I'm not doing anything fancy in my app, here's the relevant code:
val lr = new LogisticRegression().setRegParam(1)
val model = lr.fit(trainingSet.toDF())
In comparison, a coworker trained a logistic regression model on her laptop using the Java library liblinear in just a few minutes. That's with the ~20 million-sized feature vectors. This suggests to me there is some issue with Spark ML's implementation of logistic regression which is limiting its scalability.

Note that my feature vectors are very sparse. The maximum feature is around 20 million, but I think there are only 10's of thousands of features.

Has anyone run into this? Any idea where the bottleneck is or how this problem might be solved?

One solution of course is to implement some dimensionality reduction. I'd really like to avoid this, as it's just another thing to deal with - not so hard to put it into the trainer, but then anything doing scoring will need the same logic. Unless Spark ML supports this out of the box? An easy way to save / load a model along with the dimensionality reduction logic so when transform is called on the model it will handle the dimensionality reduction transparently?

Any advice would be appreciated.

~Daniel Siegmann