mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: Samsara's learning curve
Date Wed, 29 Mar 2017 16:26:03 GMT
While I agree with D and T, I’ll add a few things to watch out for.

One of the hardest things to learn is the new model of execution, it’s not quite Spark or
any other compute engine. You need to create contexts that have virtualized the actual compute
engine. But you will probably need to use the actual compute engine too. Switching back and
forth is fairly simple but must be learned and could be documented better.

The other missing bit is dataframes. R and Spark have them in different forms but Mahout largely
ignores the issue of real world object ids. Again not vey hard to work around and here’s
hoping it's added in a future rev.


On Mar 27, 2017, at 1:38 PM, Trevor Grant <trevor.d.grant@gmail.com> wrote:

I tend to agree with D.

For example, I set out to do the 'Eigenfaces problem' last year, and wrote
a blog on it.  It ended up being about 4 lines of Samsara code (+ imports),
the "hardest" part was loading images into vectors, and then vectors back
into images (wasn't awful, but I was new to Scala).  In addition to the
modest marketing and a lack of introductory tutorials, is that to really
use Mahout-Samsara in the first place you need to have a fairly good grasp
of linear algebra, which gives it significantly less mass-appeal than say
an mllib/sklearn/etc. Your
I-just-got-my-data-science-certificate-from-coursera data scientists simply
aren't equipped to use Mahout.  Your advanced-R-type data scientists can
use it- but unless they have a problem that is to big for a single machine,
have no motivation to use it (may change with native solvers, more
algorithms, etc), and even given motivation the question then becomes learn
Mahout OR come up with a clever trick for being able to stay in a single
machine.

But yea- a fairly easy and pleasant framework.  If you have the proper
motivation, there is simply nothing else like it.

tg

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Mon, Mar 27, 2017 at 12:32 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
wrote:

> I believe writing in the DSL is simple enough, especially if you have some
> familiarity with Scala on top of R (or, in my case, R on top of Scala
> perhaps:). I've implemented about couple dozens customized algorithms that
> used distributed Samsara algebra at least to some degree, and I think I can
> reliably attest none of them ever exceeded 100 lines or so, and that it
> significantly reduced my time dedicated to writing algebra on top of Spark
> and some other backends I use under proprietary settings. I am now mostly
> doing non-algebraic improvements because writing algebra is easy.
> 
> The most difficult part however, at least for me, and as you can see as you
> go along with the  book, was not the pecularities of R-like bindings, but
> the algorithm reformulations. Traditional "in-memory" algorithms do not
> work on shared-nothing backends, even though you could program them, they
> simply will not perform.
> 
> The main reasons some of the traditional algorithms do not work at scale
> are because they either require random memory access, or (more often) are
> simply super-linear w.r.t. input size, so as one scales  infrastructure at
> linear cost, one would still incur less than expected increment in
> performance (if any at all, at some point) per unit of input.
> 
> Hence, usually some mathematically, or should i say, statistically
> motivated tricks are still required. As the book describes, linearly or
> sub-linearly scalable sketches, random projections, dimensionality
> reductions etc. etc. are required to alleviate scalability issues of the
> super-linear algorithms.
> 
> To your question, i got couple of people doing some pieces on various
> projects before with Samsara, but they had me as a coworker. I am
> personally not aware of any outside developers beyond people already on the
> project @ Apache and my co-workers, although in all honesty i feel it has
> to do more with maturity and modest marketing of the public version of
> Samsara than necessarily the difficulty of adoption.
> 
> -d
> 
> 
> 
> On Sun, Mar 26, 2017 at 9:15 AM, Gustavo Frederico <
> gustavo.frederico@thinkwrap.com> wrote:
> 
>> I read Lyubimov's and Palumbo's book on Mahout Samsara up to chapter 4
>> ( Distributed Algebra ). I have some familiarity with R, I did study
>> linear algebra and calculus in undergrad. In my master's I studied
>> statistical pattern recognition and researched a number of ML
>> algorithms in my thesis - spending more time on SVMs. This is to ask:
>> what is the learning curve of Samsara? How complicated is to work with
>> distributed algebra to create an algorithm? Can someone share an
>> example of how long she/he took to go from algorithm conception to
>> implementation?
>> 
>> Thanks
>> 
>> Gustavo
>> 
> 


Mime
View raw message