mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Trevor Grant <trevor.d.gr...@gmail.com>
Subject Re: Samsara's learning curve
Date Mon, 27 Mar 2017 20:38:49 GMT
I tend to agree with D.

For example, I set out to do the 'Eigenfaces problem' last year, and wrote
a blog on it.  It ended up being about 4 lines of Samsara code (+ imports),
the "hardest" part was loading images into vectors, and then vectors back
into images (wasn't awful, but I was new to Scala).  In addition to the
modest marketing and a lack of introductory tutorials, is that to really
use Mahout-Samsara in the first place you need to have a fairly good grasp
of linear algebra, which gives it significantly less mass-appeal than say
an mllib/sklearn/etc. Your
I-just-got-my-data-science-certificate-from-coursera data scientists simply
aren't equipped to use Mahout.  Your advanced-R-type data scientists can
use it- but unless they have a problem that is to big for a single machine,
have no motivation to use it (may change with native solvers, more
algorithms, etc), and even given motivation the question then becomes learn
Mahout OR come up with a clever trick for being able to stay in a single
machine.

But yea- a fairly easy and pleasant framework.  If you have the proper
motivation, there is simply nothing else like it.

tg

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Mon, Mar 27, 2017 at 12:32 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
wrote:

> I believe writing in the DSL is simple enough, especially if you have some
> familiarity with Scala on top of R (or, in my case, R on top of Scala
> perhaps:). I've implemented about couple dozens customized algorithms that
> used distributed Samsara algebra at least to some degree, and I think I can
> reliably attest none of them ever exceeded 100 lines or so, and that it
> significantly reduced my time dedicated to writing algebra on top of Spark
> and some other backends I use under proprietary settings. I am now mostly
> doing non-algebraic improvements because writing algebra is easy.
>
> The most difficult part however, at least for me, and as you can see as you
> go along with the  book, was not the pecularities of R-like bindings, but
> the algorithm reformulations. Traditional "in-memory" algorithms do not
> work on shared-nothing backends, even though you could program them, they
> simply will not perform.
>
> The main reasons some of the traditional algorithms do not work at scale
> are because they either require random memory access, or (more often) are
> simply super-linear w.r.t. input size, so as one scales  infrastructure at
> linear cost, one would still incur less than expected increment in
> performance (if any at all, at some point) per unit of input.
>
> Hence, usually some mathematically, or should i say, statistically
> motivated tricks are still required. As the book describes, linearly or
> sub-linearly scalable sketches, random projections, dimensionality
> reductions etc. etc. are required to alleviate scalability issues of the
> super-linear algorithms.
>
> To your question, i got couple of people doing some pieces on various
> projects before with Samsara, but they had me as a coworker. I am
> personally not aware of any outside developers beyond people already on the
> project @ Apache and my co-workers, although in all honesty i feel it has
> to do more with maturity and modest marketing of the public version of
> Samsara than necessarily the difficulty of adoption.
>
> -d
>
>
>
> On Sun, Mar 26, 2017 at 9:15 AM, Gustavo Frederico <
> gustavo.frederico@thinkwrap.com> wrote:
>
> > I read Lyubimov's and Palumbo's book on Mahout Samsara up to chapter 4
> > ( Distributed Algebra ). I have some familiarity with R, I did study
> > linear algebra and calculus in undergrad. In my master's I studied
> > statistical pattern recognition and researched a number of ML
> > algorithms in my thesis - spending more time on SVMs. This is to ask:
> > what is the learning curve of Samsara? How complicated is to work with
> > distributed algebra to create an algorithm? Can someone share an
> > example of how long she/he took to go from algorithm conception to
> > implementation?
> >
> > Thanks
> >
> > Gustavo
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message