hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Gates <ga...@yahoo-inc.com>
Subject Re: Map-Balance-Reduce draft
Date Tue, 09 Feb 2010 01:22:54 GMT

Sorry if any of my questions or comments would have been answered by  
the diagrams, but apache lists don't allow attachments, so I can't see  
your diagrams.

If I understand correctly, your suggestion for balancing is to apply  
reduce on subsets of the hashed data, and then run reduce again on  
this reduced data set.  Is that correct?  If so, how does this differ  
from the combiner?  Second, some aggregation operations truly aren't  
algebraic (that is, they cannot be distributed across multiple  
iterations of reduce).   An example of this is session analysis, where  
the algorithm truly needs to see all operations together to analyze  
the user session.  How do you propose to handle that case?


On Feb 7, 2010, at 11:25 PM, jian yi wrote:

> Two targets:
> 1. Solving the skew problem
> 2. Regarding a task as a timeslice to improve on scheduler,  
> switching a job to another job by timeslice.
> In MR (Map-Reduce) model, reducings are not balanced, because the  
> scale of partitiones are unbalanced. How to balance? We can control  
> the size of partition, rehash the bigger parition and combine to the  
> specified size. If a key has many values, it's necessary to execute  
> mapreduce twice.The following is the model digram:
> Scheduler can regard a task as a timeslice similarly OS scheduler.
> If a split is bigger than a specified size, it will be splitted  
> again. If a split is smaller than a specified size, it will be  
> combined with others, we can name the combining procedure regroup.  
> The combining is logic, it's not necessay to combine these smaller  
> splits to a disk file, which will not affect the performance.The  
> target is that every task spent same time running.

View raw message