From Sigurd Spieckermann <sigurd.spieckerm...@gmail.com>
Subject Re: Combiner applied on multiple map task outputs (like in Mahout SVD)
Date Wed, 26 Sep 2012 13:38:27 GMT
```Yes, but one int/vector pair corresponds to the respective column of A
multiplied by an element of the respective row of B, correct? So the
concatenation of the resulting columns would be outer product of the column
of A and the row of B. None of these vectors are summed up but rather the
outer products of multiple map tasks are summed up. So what is the job of
the combiner here? It would be nice if the combiner could sum up all outer
products computed on that datanode, but this is the part I can't see
happening in Hadoop. Is the general statement correct that a combiner is
only applied to all outputs of a *map task* and that a map task processes
all key-value pairs of a split? In this case, there is only one key-value
pair per split, right? The int/vector being index and column/row of the
matrix.

2012/9/26 Jake Mannix <jake.mannix@gmail.com>

> On Wed, Sep 26, 2012 at 4:49 AM, Sigurd Spieckermann <
> sigurd.spieckermann@gmail.com> wrote:
>
> > Hi guys,
> >
> > I'm trying to understand the way the combiner in Mahout SVD works. (
> > https://cwiki.apache.org/MAHOUT/dimensional-reduction.html) As far as I
> > know from the Mahout math matrix-multiplication implementation, matrix A
> is
> > represented by column-vectors, matrix B is represented by row vectors and
> > an inner join executes an outer product of the columns of A with the rows
> > of B. All outer products are summed by the combiners and reducers. What I
> > am wondering about is how a combiner can actually combine multiple outer
> > products on the same datanode because the join-package requires the data
> to
> > be partitioned into unsplittable files. In this case, I understand that
> one
> > file contains one column/row of its corresponding matrix. Hence, each map
> > task receives a column-row-tuple, computes the outer product and emits
> the
> > result.
>
>
> This all sounds right, but not the following:
>
>
> > My understanding of Hadoop is that the combiner follows a map task
> > immediately but one map task produces only a single result so there is
> > nothing to combine.
>
>
> That part is not true - a mapper may emit more than one key-value pair (and
> for
> matrix multiplication, this is true *a fortiori* - there is one int/vector
> pair emitted per
> nonzero element of the row being mapped over).
>
>
> > If the combiner could accumulate the results of
> > multiple map task, I would understand the idea, but from my understanding
> > and tests, it does not.
> >
> > Could anyone clarify the process please?
> >
> > Thanks a lot!
> > Sigurd
> >
>
>
>
> --
>
>   -jake
>

```
