mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mahesh Balija <balijamahesh....@gmail.com>
Subject Re: Mahout Vs Spark
Date Wed, 22 Oct 2014 19:16:04 GMT
Hi Dmitriy,

My apologizes if I have conveyed my questions incorrectly.
Also my intentions are definitely NOT arguments.

I have experience with Mahout, I am also working on some content to make
Mahout simplified due to which I needed this clarifications. I am also
validating both the frameworks, just wanted to take some inputs from the
active contributors.

Best!
Mahesh Balija.






On Wed, Oct 22, 2014 at 6:57 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:

> For the record, this is all false dilemma (at least w.r.t. spark vs mahout
> spark bindings).
>
> The spark bindings have never been concieved as one vs another.
>
> Mahout scala bindings is on-top add-on to spark that just happens to rely
> on some of things in mahout-math.
>
> With spark one gets some major things being RDDS, mllib, spark QL and
> GraphX.
>
> Guess what, in Spark bindings one still gets all of those wonderful things,
> plus the bindings and bindings shell.
>
> Most add-on values in spark bindings are R-like notation for the algebra
> and distributed algebraic optimizer.  Of course there are all those
> wonderful distributed decompositions and pca things, naive base and i think
> some of co-occurrence stuff too. (implicit ALS work for spark was never
> committed, sadly, available on a PR branch only). internally my company
> have built several x more methodology code on spark bindings than spark
> binding has on its own.
>
> Spark bindings are also 100% scala. The only thing that is non-scala (at
> runtime)  is the in-memory Colt-derived matrix model, which is adapted to
> r-like dsl with scala bindings. Oh well. can't have it all.
>
> Bottom line,  for most part I feel you are building a straw man argument
> here. You presenting a problem as being a constrained choice with
> inevitable loss, whereas there has never been a loss of a choice. Even for
> the sake of algebraic decompositions and optimizations i feel there's a
> significant added value. (of course again this is only relevant to bindings
> stuff, not the 0.9 MR stuff all of which is now deprecated).
>
> The only two problems I see is that (1) Mahout takes in too much legacy
> dependencies that are hard to sort thru if one is using it strictly in
> spark base apps. Too many things to sort thru and throw away in that tree.
> I actually use an opt-in approach (that is, i remove all transitive
> dependencies by default and only add them one-by-one if there's actual
> runtime dependency). This is something that could, and should be improved
> incrementally.
>
> Second design problem is that Mahout may be a bit of a problem for using
> alongside other on-top-of-spark systems because it takes over some things
> in Spark (e.g. it requires things to work with kryo). But this is more of
> the Spark limitation itself.
>
>
> But speaking of "survival" and "popularity" concerns, which are very valid
> themselves, I think the major problem with Mahout is none of these alleged
> vs things. Strictly IMO it is that being a ML project, unlike all those
> other wonderful things, it is not widely backed by any major university or
> academic community.  It has never been. And at this point it would seem it
> will never be. As such, unlike with some other projects, there is no
> perpetual source of ambitious researchers to contribute. And original
> founders long since posted their last significant contribution.
>
> On Wed, Oct 22, 2014 at 9:20 AM, Mahesh Balija <balijamahesh.mca@gmail.com
> >
> wrote:
>
> > Hi Team,
> >
> > Thanks for your replies, even if you consider the strong implementation
> of
> > Recommendations and SVD in Mahout, I would still say that even in Spark
> > 1.1.0 there is support for collaborative filtering (alternating least
> > squares (ALS)) and under dimensionality reduction SVD and PCA. With fast
> > pace contributions, I believe Spark may NOT be far away to have new and
> > stable algorithms added to it (Like ANN, HMM etc and support for
> scientific
> > libraries).
> >
> > Ted, Even though Mahout (1.0) development code base support Scala and
> Spark
> > bindings externally, Spark has this inbuilt support for Scala (as its
> been
> > developed in Scala). And Numpy is a python based scientific library which
> > need to be used for the support of Python based MLlib in Spark. Benefits
> > are python is also supported in Spark for Python users.
> >
> > Major uniqueness of Mahout is, as Mahout is inherited from Lucene it has
> > built-in support for Text processing. Ofcourse I do NOT believe its a
> > strong point as I assume that, developers knowing Lucene can be able to
> > easily use it with Spark through Java interface.
> >
> > Mahout currently stopped support for Hadoop (i.e., for further libraries)
> > on the other hand Spark can re-use the data present in Hadoop/Hbase
> easily
> > (May NOT be mapreduce functionality as Spark has its own computation
> > layer).
> >
> > *As a user of Mahout since long time I strongly support Mahout (despite
> of
> > poor visualization capabilities), at the same time, I am trying to
> > understand if Spark continues to be evolved in MLLib package and being
> > support for in-memory computation and with rich scientific libraries
> > through Scala and support for languages like Java/Scala/Python will the
> > survival of Mahout be questionable?*
> >
> > Best!
> > Mahesh Balija.
> >
> >
> >
> > On Wed, Oct 22, 2014 at 1:26 PM, Martin, Nick <NiMartin@pssd.com> wrote:
> >
> > > I know we lost the maintainer for fpgrowth somewhere along the line but
> > > it's definitely something I'd love to see carried forward, too.
> > >
> > > Sent from my iPhone
> > >
> > > > On Oct 22, 2014, at 8:09 AM, "Brian Dolan" <buddha314@gmail.com>
> > wrote:
> > > >
> > > > Sing it, brother!  I miss FP Growth as well.  Once the Scala bindings
> > > are in, I'm hoping to work up some time series methods.
> > > >
> > > >> On Oct 21, 2014, at 8:00 PM, Lee S <sleefd@gmail.com> wrote:
> > > >>
> > > >> As a developer, who is facing the library  chosen between mahout and
> > > mllib,
> > > >> I have some idea below.
> > > >> Mahout has no any decision tree algorithm. But MLLIB has the
> > components
> > > of
> > > >> constructing a decision tree algorithm such as gini index,
> information
> > > >> gain. And also  I think mahout can add algorithm about frequency
> > pattern
> > > >> mining which is very import in feature selection and statistic
> > analysis.
> > > >> MLLIB has no frequent mining algorithms.
> > > >> p.s Why fpgrowth algorithm is removed in version 0.9?
> > > >>
> > > >> 2014-10-22 9:12 GMT+08:00 Vibhanshu Prasad <
> vibhanshugsoc2@gmail.com
> > >:
> > > >>
> > > >>> actually spark is available in python also, so users of spark
are
> > > having an
> > > >>> upper hand over users of traditional users of mahout. This is
> > > applicable to
> > > >>> all the libraries of python (including numpy).
> > > >>>
> > > >>> On Wed, Oct 22, 2014 at 3:54 AM, Ted Dunning <
> ted.dunning@gmail.com>
> > > >>> wrote:
> > > >>>
> > > >>>> On Tue, Oct 21, 2014 at 3:04 PM, Mahesh Balija <
> > > >>> balijamahesh.mca@gmail.com
> > > >>>> wrote:
> > > >>>>
> > > >>>>> I am trying to differentiate between Mahout and Spark,
here is
> the
> > > >>> small
> > > >>>>> list,
> > > >>>>>
> > > >>>>> Features Mahout Spark  Clustering Y Y  Classification
Y Y
> > > >>> Regression Y
> > > >>>>> Y  Dimensionality Reduction Y Y  Java Y Y  Scala N Y 
Python N Y
> > > >>> Numpy N
> > > >>>>> Y  Hadoop Y Y  Text Mining Y N  Scala/Spark Bindings Y
N/A
> > > >>> scalability Y
> > > >>>>> Y
> > > >>>>
> > > >>>> Mahout doesn't actually have strong features for clustering,
> > > >>> classification
> > > >>>> and regression. Mahout is very strong in recommendations (which
> you
> > > don't
> > > >>>> mention) and dimensionality reduction.
> > > >>>>
> > > >>>> Mahout does support scala in the development version.
> > > >>>>
> > > >>>> What do you mean by support for Numpy?
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message