mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <...@apache.org>
Subject Re: Combiner applied on multiple map task outputs (like in Mahout SVD)
Date Wed, 26 Sep 2012 14:38:43 GMT
I think this comes down to scalability vs performance.

If you were to combine the outputs of several map tasks (that run in
different VMs), you would have to make these tasks dependent on each
other, which might cause lots of problems: straggling tasks will slow
down all the other tasks on the machine, you might have to restart all
tasks if one fails, etc.

--sebastian


On 26.09.2012 16:22, Sigurd Spieckermann wrote:
> OK, we've been on the same page then up to the point where you say one
> split stores multiple columns/rows, not just one. In that case, if there
> are N splits on each datanode, there will still be N (big, but probably
> sparse) matrices transferred over the network because only the emits of one
> split can be combined. Also, there is some upper bound of the size of a
> matrix by the HDFS block size and the number of nonzero elements per
> column/row right? So the bigger the matrix, the less the gain by the
> combiner because fewer stripes fit into one split. Is this no problem in
> practice? And about my particular application problem, is there any way to
> combine the outputs of multiple map tasks, not just the results of a single
> map task?
> 
> 2012/9/26 Sebastian Schelter <ssc@apache.org>
> 
>> Hi Sigurd,
>>
>> I think that's the misconception then: "each stripe (column/row) is
>> stored in a single file".
>>
>> Each split contains (IntWritable, VectorWritable)-tuples, for the first
>> matrix, these represent the columns, for the second, these represent the
>> rows.
>>
>> In order to compute the outer products, these two inputs are joined via
>> a map-side join conducted by Hadoop's composite input format. This is a
>> very effective way, because you can exploit data locality. If you have
>> two matching input splits on the same machine, there is no network
>> traffic involved in joining them.
>>
>> Note that this approach only works if both inputs are partitioned and
>> sorted in the same way.
>>
>> --sebastian
>>
> 


Mime
View raw message