mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <...@apache.org>
Subject Re: Combiner applied on multiple map task outputs (like in Mahout SVD)
Date Wed, 26 Sep 2012 14:06:37 GMT
Hi Sigurd,

I think that's the misconception then: "each stripe (column/row) is
stored in a single file".

Each split contains (IntWritable, VectorWritable)-tuples, for the first
matrix, these represent the columns, for the second, these represent the
rows.

In order to compute the outer products, these two inputs are joined via
a map-side join conducted by Hadoop's composite input format. This is a
very effective way, because you can exploit data locality. If you have
two matching input splits on the same machine, there is no network
traffic involved in joining them.

Note that this approach only works if both inputs are partitioned and
sorted in the same way.

--sebastian

Mime
View raw message