spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiangrui Meng <>
Subject Re: is spark a good fit for sequential machine learning algorithms?
Date Tue, 04 Nov 2014 05:19:40 GMT
Many ML algorithms are sequential because they were not designed to be
parallel. However, ML is not driven by algorithms in practice, but by
data and applications. As datasets getting bigger and bigger, some
algorithms got revised to work in parallel, like SGD and matrix
factorization. MLlib tries to implement those scalable algorithms that
can handle large-scale datasets.

That being said, even with sequential ML algorithms, Spark is helpful.
Because in practice we need to test multiple sets of parameters and
select the best one. Though the algorithm is sequential, the training
part is embarrassingly parallel. We can broadcast the whole dataset,
and then train model 1 on node 1, model 2 on node 2, etc. Cross
validation also falls into this category.


On Mon, Nov 3, 2014 at 1:55 PM, ll <> wrote:
> i'm struggling with implementing a few algorithms with spark.  hope to get
> help from the community.
> most of the machine learning algorithms today are "sequential", while spark
> is all about "parallelism".  it seems to me that using spark doesn't
> actually help much, because in most cases you can't really paralellize a
> sequential algorithm.
> there must be some strong reasons why mllib was created and so many people
> claim spark is ideal for machine learning.
> what are those reasons?
> what are some specific examples when & how to use spark to implement
> "sequential" machine learning algorithms?
> any commen/feedback/answer is much appreciated.
> thanks!
> --
> View this message in context:
> Sent from the Apache Spark User List mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message