spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Machine Learning on Spark [long rambling discussion email]
Date Thu, 25 Jul 2013 18:16:48 GMT
On Wed, Jul 24, 2013 at 11:47 AM, Reynold Xin <rxin@cs.berkeley.edu> wrote:

> On Wed, Jul 24, 2013 at 1:46 AM, Nick Pentreath <nick.pentreath@gmail.com
> >wrote:
>
> >
> > I also found Breeze to be very nice to work with and like the DSL - hence
> > my question about why not use that? (Especially now that Breeze is
> actually
> > just breeze-math and breeze-viz).
> >
>
>
> Matei addressed this from a higher level. I want to provide a little bit
> more context. A common properties of a lot of high level Scala DSL
> libraries is that simple operators tend to have high virtual function
> overheads and also create a lot of temporary objects. And because the level
> of abstraction is so high, it is fairly hard to debug / optimize
> performance.
>

I was *kinda* worried about it too.

But like it often happens, it would seem to me we are worrying about
somethign that will never compare with the bulk computation.

Consider this fragment (it is from one of the flavors of weighted ALS with
weighted regularization) :


        val cholArg =  icVtV + (vBlockForC.t %*%: diagv(d)) %*% vBlockForC
+ diag(n_u * lambda, k)


yes we  just created a few object references here for GC with scala
implicit conversions while having million flops behind the scene sent to
FPU meanwhile AND we  have optimized left-multiply (%*%: operator)  with a
diagonal matrix as well as made use of symmetric matrix optimizations in a
very scala way. And it looks just like what R folks would understand. The
benefits of DSL clearly outweigh whatever claimed overhead exists IMO.
Can't deny i find it subjectively more elegant than
vblock.transpose().times(....new DiagonalMatrix (n_u*lambda,k) ... )


As far as Mahout abstraction quality (and i am assuming we are talking
about in-core lin alg support here, cause there's much more else), this is
debatable but that's why i actually started doing DSL in the first place.
DSL should iron a lot of that out, as we have seen, and bring it closer to
R/matlab look&feel.

But there are other important factors about Mahout's in-core support.

I did my honest homework for my project trying to pick on in-core linear
algebra, and i was not stuck on Mahout in-core support at all. I actually
really wanted to find something a bit more mature for in-core algebra.

In my search, I failed to find a project to address the following two major
problems for in-core BLAS:

1) naturally embedded support for sparse matrices and optimizations aimed
degenerate nature of zero elements. No other project quite doing it to the
same degree. apache-math sparse matrices are deprecated and said to be
broken. JBLAS/ LAPACK doesn't have degenerate element optimizations at all.
Breeze lacks consistency in Matrix abstraction between sparse and dense
matrices. etc. etc.

2) Kind of extension of the 1, wide range of matrix support optimized for
various speicific solver computations -- diagonal, upper/lower triangulars,
symmetric parsimonious, pivoted, rowwise vs. column wise vs. open addressed
sparse matrices etc. etc. especially with the latest effort there. Nobody
came close to that variety and ease of sparse operation optimizations in my
(however brief) search.

it is kinda raw at times, but nothing that i can't handle.

But i totally agree that any sort of such environment is not part of spark.
It makes some pragmatic tasks very addressable though and i can see a
roadmap where i could mix Mahout's distributed solvers with sparks freely
until i have a chance to port/create more what i need on spark side without
any additional format/conversion issues.




>
>
>
> --
> Reynold Xin, AMPLab, UC Berkeley
> http://rxin.org
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message