mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: Combiner applied on multiple map task outputs (like in Mahout SVD)
Date Wed, 26 Sep 2012 16:21:40 GMT
It should also be noted that the Combiner does not only run for the mappers
-
they can be used one (or more) times after mapping, and then one or more
times before the reducer gets the results.  It's not quite so simple as to
say that
you get combiners used only (and always) on the outputs of each map task.

Also, your technique of doing "mapside caching" is in fact fairly common,
but
don't worry about trying to reuse the JVM-level data, that hack will be
ugly and
not worth it: simply store local variables in your Mapper instance, and emit
the collected / aggregated data once during the close() method of the
Mapper.

We do this in e.g. our LDA implementation (see for example
o.a.m.clustering.lda.cvb.CachingCVB0Mapper ).

On Wed, Sep 26, 2012 at 7:50 AM, Sigurd Spieckermann <
sigurd.spieckermann@gmail.com> wrote:

> I see. The description of the SVD implementation made me wonder if I had a
> wrong understanding of Hadoop or of Mahout and whether I could potentially
> tune my code. When I was thinking about how to combine across multiple map
> tasks (given it was/is not configurable in Hadoop), I was considering to
> reuse the JVM and delay the emits until a few (or all) map tasks have
> completed, but in terms of implementation I encountered difficulties to
> ensure I'm at least emitting at the end of the very last map task on the
> datanode. It would have ended up in some hack storing data in a singleton
> class or static variable or something like that and emitting at some later
> point. But I see your point about the trade-off I would be making towards
> more performance (maybe) but potentially not, e.g. if a datanode fails
> after some map tasks have finished but some haven't.
>
> I guess my questions have been answered. Thank you very much guys for the
> clarifying discussion!
>
> 2012/9/26 Sebastian Schelter <ssc@apache.org>
>
> > I think this comes down to scalability vs performance.
> >
> > If you were to combine the outputs of several map tasks (that run in
> > different VMs), you would have to make these tasks dependent on each
> > other, which might cause lots of problems: straggling tasks will slow
> > down all the other tasks on the machine, you might have to restart all
> > tasks if one fails, etc.
> >
> > --sebastian
> >
> >
> > On 26.09.2012 16:22, Sigurd Spieckermann wrote:
> > > OK, we've been on the same page then up to the point where you say one
> > > split stores multiple columns/rows, not just one. In that case, if
> there
> > > are N splits on each datanode, there will still be N (big, but probably
> > > sparse) matrices transferred over the network because only the emits of
> > one
> > > split can be combined. Also, there is some upper bound of the size of a
> > > matrix by the HDFS block size and the number of nonzero elements per
> > > column/row right? So the bigger the matrix, the less the gain by the
> > > combiner because fewer stripes fit into one split. Is this no problem
> in
> > > practice? And about my particular application problem, is there any way
> > to
> > > combine the outputs of multiple map tasks, not just the results of a
> > single
> > > map task?
> > >
> > > 2012/9/26 Sebastian Schelter <ssc@apache.org>
> > >
> > >> Hi Sigurd,
> > >>
> > >> I think that's the misconception then: "each stripe (column/row) is
> > >> stored in a single file".
> > >>
> > >> Each split contains (IntWritable, VectorWritable)-tuples, for the
> first
> > >> matrix, these represent the columns, for the second, these represent
> the
> > >> rows.
> > >>
> > >> In order to compute the outer products, these two inputs are joined
> via
> > >> a map-side join conducted by Hadoop's composite input format. This is
> a
> > >> very effective way, because you can exploit data locality. If you have
> > >> two matching input splits on the same machine, there is no network
> > >> traffic involved in joining them.
> > >>
> > >> Note that this approach only works if both inputs are partitioned and
> > >> sorted in the same way.
> > >>
> > >> --sebastian
> > >>
> > >
> >
> >
>



-- 

  -jake

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message