spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacek Laskowski <>
Subject Re: MLLib + Streaming
Date Sun, 06 Mar 2016 08:27:25 GMT

Thanks a lot, Guru Medasani, for such an excellent theory rich intro to
MLlib! I wish I found only such emails in my mailbox.

Sorry. Couldn't resist since I've just started with MLlib and your response
has resonated so well with my initial experience.


06.03.2016 6:55 AM "Guru Medasani" <> napisał(a):

> Hi Lan,
> Streaming Means, Linear Regression and Logistic Regression support online
> machine learning as you mentioned. Online machine learning is where model
> is being trained and updated on every batch of streaming data. These models
> have trainOn() and predictOn() methods where you can simply pass in
> DStreams you want to train the model on and DStreams you want the model to
> predict on. So when the next batch of data arrives model is trained and
> updated again. In this case model weights are continually updated and
> hopefully model performs better in terms of convergence and accuracy over
> time. What we are really trying to do in online learning case is that we
> are only showing few examples of the data at a time ( stream of data) and
> updating the parameters in case of Linear and Logistic Regression and
> updating the centers in case of K-Means. In the case of Linear or Logistic
> Regression this is possible due to the optimizer that is chosen for
> minimizing the cost function which is Stochastic Gradient Descent. This
> optimizer helps us to move closer and closer to the optimal weights after
> every batch and over the time we will have a model that has learned how to
> represent our data and predict well.
> In the scenario of using any MLlib algorithms and doing training with
> DStream.transform() and DStream.foreachRDD() operations, when the first
> batch of data arrives we build a model, let’s call this model1. Once you
> have the model1 you can make predictions on the same DStream or a different
> DStream source. But for the next batch if you follow the same procedure and
> create a model, let’s call this model2. This model2 will be significantly
> different than model1 based on how different the data is in the second
> DStream vs the first DStream as it is not continually updating the model.
> It’s like weight vectors are jumping from one place to the other for every
> batch and we never know if the algorithm is converging to the optimal
> weights. So I believe it is not possible to do true online learning with
> other MLLib models in Spark Streaming.  I am not sure if this is because
> the models don’t generally support this streaming scenarios or if the
> streaming versions simply haven’t been implemented yet.
> Though technically you can use any of the MLlib algorithms in Spark
> Streaming with the procedure you mentioned and make predictions, it is
> important to figure out if the model you are choosing can converge by
> showing only a subset(batches  - DStreams) of the data over time. Based on
> the algorithm you choose certain optimizers won’t necessarily be able to
> converge by showing only individual data points and require to see majority
> of the data to be able to learn optimal weights.  In these cases, you can
> still do offline learning/training with Spark bach processing using any of
> the MLlib algorithms and save those models on hdfs. You can then start a
> streaming job and load these saved models into your streaming application
> and make predictions. This is traditional offline learning.
> In general, online learning is hard as it’s hard to evaluate since we are
> not holding any test data during the model training. We are simply training
> the model and predicting. So in the initial batches, results can vary quite
> a bit and have significant errors in terms of the predictions. So choosing
> online learning vs. offline learning depends on how much tolerance the
> application can have towards wild predictions in the beginning. Offline
> training is simple and cheap where as online training can be hard and needs
> to be constantly monitored to see how it is performing.
> Hope this helps in understanding offline learning vs. online learning and
> which algorithms you can choose for online learning in MLlib.
> Guru Medasani
> > On Mar 5, 2016, at 7:37 PM, Lan Jiang <> wrote:
> >
> > Hi, there
> >
> > I hope someone can clarify this for me.  It seems that some of the MLlib
> algorithms such as KMean, Linear Regression and Logistics Regression have a
> Streaming version, which can do online machine learning. But does that mean
> other MLLib algorithm cannot be used in Spark streaming applications, such
> as random forest, SVM, collaborate filtering, etc??
> >
> > DStreams are essentially a sequence of RDDs. We can use
> DStream.transform() and DStream.foreachRDD() operations, which allows you
> access RDDs in a DStream and apply MLLib functions on them. So it looks
> like all MLLib algorithms should be able to run in the streaming
> application. Am I wrong?
> >
> > Lan
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> > For additional commands, e-mail:
> >
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message