mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sigurd Spieckermann <sigurd.spieckerm...@gmail.com>
Subject Re: Combiner applied on multiple map task outputs (like in Mahout SVD)
Date Wed, 26 Sep 2012 14:50:36 GMT
I see. The description of the SVD implementation made me wonder if I had a
wrong understanding of Hadoop or of Mahout and whether I could potentially
tune my code. When I was thinking about how to combine across multiple map
tasks (given it was/is not configurable in Hadoop), I was considering to
reuse the JVM and delay the emits until a few (or all) map tasks have
completed, but in terms of implementation I encountered difficulties to
ensure I'm at least emitting at the end of the very last map task on the
datanode. It would have ended up in some hack storing data in a singleton
class or static variable or something like that and emitting at some later
point. But I see your point about the trade-off I would be making towards
more performance (maybe) but potentially not, e.g. if a datanode fails
after some map tasks have finished but some haven't.

I guess my questions have been answered. Thank you very much guys for the
clarifying discussion!

2012/9/26 Sebastian Schelter <ssc@apache.org>

> I think this comes down to scalability vs performance.
>
> If you were to combine the outputs of several map tasks (that run in
> different VMs), you would have to make these tasks dependent on each
> other, which might cause lots of problems: straggling tasks will slow
> down all the other tasks on the machine, you might have to restart all
> tasks if one fails, etc.
>
> --sebastian
>
>
> On 26.09.2012 16:22, Sigurd Spieckermann wrote:
> > OK, we've been on the same page then up to the point where you say one
> > split stores multiple columns/rows, not just one. In that case, if there
> > are N splits on each datanode, there will still be N (big, but probably
> > sparse) matrices transferred over the network because only the emits of
> one
> > split can be combined. Also, there is some upper bound of the size of a
> > matrix by the HDFS block size and the number of nonzero elements per
> > column/row right? So the bigger the matrix, the less the gain by the
> > combiner because fewer stripes fit into one split. Is this no problem in
> > practice? And about my particular application problem, is there any way
> to
> > combine the outputs of multiple map tasks, not just the results of a
> single
> > map task?
> >
> > 2012/9/26 Sebastian Schelter <ssc@apache.org>
> >
> >> Hi Sigurd,
> >>
> >> I think that's the misconception then: "each stripe (column/row) is
> >> stored in a single file".
> >>
> >> Each split contains (IntWritable, VectorWritable)-tuples, for the first
> >> matrix, these represent the columns, for the second, these represent the
> >> rows.
> >>
> >> In order to compute the outer products, these two inputs are joined via
> >> a map-side join conducted by Hadoop's composite input format. This is a
> >> very effective way, because you can exploit data locality. If you have
> >> two matching input splits on the same machine, there is no network
> >> traffic involved in joining them.
> >>
> >> Note that this approach only works if both inputs are partitioned and
> >> sorted in the same way.
> >>
> >> --sebastian
> >>
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message