OK, we've been on the same page then up to the point where you say one
split stores multiple columns/rows, not just one. In that case, if there
are N splits on each datanode, there will still be N (big, but probably
sparse) matrices transferred over the network because only the emits of one
split can be combined. Also, there is some upper bound of the size of a
matrix by the HDFS block size and the number of nonzero elements per
column/row right? So the bigger the matrix, the less the gain by the
combiner because fewer stripes fit into one split. Is this no problem in
practice? And about my particular application problem, is there any way to
combine the outputs of multiple map tasks, not just the results of a single
map task?
2012/9/26 Sebastian Schelter <ssc@apache.org>
> Hi Sigurd,
>
> I think that's the misconception then: "each stripe (column/row) is
> stored in a single file".
>
> Each split contains (IntWritable, VectorWritable)tuples, for the first
> matrix, these represent the columns, for the second, these represent the
> rows.
>
> In order to compute the outer products, these two inputs are joined via
> a mapside join conducted by Hadoop's composite input format. This is a
> very effective way, because you can exploit data locality. If you have
> two matching input splits on the same machine, there is no network
> traffic involved in joining them.
>
> Note that this approach only works if both inputs are partitioned and
> sorted in the same way.
>
> sebastian
>
