Hm, I've had the same understanding of the definition of a map task, but my
confusion is whether the combine method is only applied to the outputs of a
map task (potentially many because a split usually has multiple keyvalue
pairs) or if the combine method is also applied to the outputs of multiple
map tasks. The way I understand the Mahout matrix multiplication using the
Hadoop joinpackage is that each stripe (column/row) is stored in a single
file (I guess because it is assumed that even one column/row can be very
big consuming up to the entire block size) and therefore a single outer
product is computed in *one* map task. If the combiner cannot combine
outputs of *multiple* map tasks, then there is nothing to combine.
2012/9/26 Sebastian Schelter <ssc@apache.org>
> If I understand the discussion correctly, there is some confusion here.
> A map task is not the same as a single invocation of the function to map.
>
> A map task consumes a split and invokes the function to map for each
> keyvalue pair contained in the split. The function to combine is
> applied (usually several times, in some implementation specific way) to
> the output of all the invocations of that map task.
> sebastian
> On 26.09.2012 15:40, Sigurd Spieckermann wrote:
> > Well, my word selection wasn't great when I said "one map task produces
> > only a single result". The way I meant this was that one map task only
> > produces a single outer product (that consist of multiple column vectors
> > hence multiple mapper emits), but those are not the ones to combine in
> this
> > case, right?
> >
> > 2012/9/26 Sigurd Spieckermann <sigurd.spieckermann@gmail.com>
> >
> >> Yes, but one int/vector pair corresponds to the respective column of A
> >> multiplied by an element of the respective row of B, correct? So the
> >> concatenation of the resulting columns would be outer product of the
> column
> >> of A and the row of B. None of these vectors are summed up but rather
> the
> >> outer products of multiple map tasks are summed up. So what is the job
> of
> >> the combiner here? It would be nice if the combiner could sum up all
> outer
> >> products computed on that datanode, but this is the part I can't see
> >> happening in Hadoop. Is the general statement correct that a combiner is
> >> only applied to all outputs of a *map task* and that a map task
> processes
> >> all keyvalue pairs of a split? In this case, there is only one
> keyvalue
> >> pair per split, right? The int/vector being index and column/row of the
> >> matrix.
> >> 2012/9/26 Jake Mannix <jake.mannix@gmail.com>
> >>
> >>> On Wed, Sep 26, 2012 at 4:49 AM, Sigurd Spieckermann <
> >>> sigurd.spieckermann@gmail.com> wrote:
> >>>> Hi guys,
> >>>>
> >>>> I'm trying to understand the way the combiner in Mahout SVD works. (
> >>>> https://cwiki.apache.org/MAHOUT/dimensionalreduction.html) As far
> as I
> >>>> know from the Mahout math matrixmultiplication implementation, matrix
> >>> A is
> >>>> represented by columnvectors, matrix B is represented by row vectors
> >>> and
> >>>> an inner join executes an outer product of the columns of A with the
> >>> rows
> >>>> of B. All outer products are summed by the combiners and reducers.
> What
> >>> I
> >>>> am wondering about is how a combiner can actually combine multiple
> outer
> >>>> products on the same datanode because the joinpackage requires the
> >>> data to
> >>>> be partitioned into unsplittable files. In this case, I understand
> that
> >>> one
> >>>> file contains one column/row of its corresponding matrix. Hence, each
> >>> map
> >>>> task receives a columnrowtuple, computes the outer product and emits
> >>> the
> >>>> result.
> >>> This all sounds right, but not the following:
> >>>> My understanding of Hadoop is that the combiner follows a map task
> >>>> immediately but one map task produces only a single result so there
is
> >>>> nothing to combine.
> >>> That part is not true  a mapper may emit more than one keyvalue pair
> >>> (and
> >>> for
> >>> matrix multiplication, this is true *a fortiori*  there is one
> int/vector
> >>> pair emitted per
> >>> nonzero element of the row being mapped over).
> >>>> If the combiner could accumulate the results of
> >>>> multiple map task, I would understand the idea, but from my
> >>> understanding
> >>>> and tests, it does not.
> >>>>
> >>>> Could anyone clarify the process please?
> >>>>
> >>>> Thanks a lot!
> >>>> Sigurd
> >>>>
