spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From George Paulson <>
Subject Spark ML/MLib newbie question
Date Mon, 19 Oct 2015 21:02:50 GMT
I have a dataset that's relatively big, but easily fits in memory. I want to
generate many different features for this dataset and then run L1
regularized Logistic Regression on the feature enhanced dataset.

The combined features will easily exhaust memory. I was hoping there was a
way that I could generate the features on the fly for stochastic gradient
descent. That is, every time the SGD routine samples from the original
dataset it will calculate the new features and use those as the input.

With Spark ML it seems like you can do transformations and add those to your
pipeline, which would work if it all fit into memory fairly easily. But, is
it possible to do something like I'm proposing ? A sort of lazy evaluation
within the current library? Or do I need to somehow change
GradientDescent.scala myself for this to work?

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message