tez-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hitesh Shah <hit...@apache.org>
Subject Re: Code pointers to Understand Auto Reduce parallelism
Date Thu, 13 Aug 2015 20:09:35 GMT
When a vertex is ready to start launching tasks, it will look at available information of output
generated by upstream vertices and based on configured optimal data size per task, it will
re-configure the vertex’s parallelism. The basic tez or mapreduce shuffle implementation
is that the map stage generates say 1000 partitions if there are 1000 reducers. The default
mode would be assign partition 1 to reducer 1 and so on. If the new parallelism is different,
a new routing table is needed to map which partitions are handled by which reducers. This
is done in Tez by overriding the default Shuffle Edge Manager with a custom Edge Manager (
i.e a new routing table ).

You can look at ShuffleVertexManager.java and start from schedulePendingTasks() to see how
parallelism is determined at runtime.

— Hitesh

On Aug 13, 2015, at 11:54 AM, Saikat Roychowdhury <saikatr@yahoo-inc.com.INVALID> wrote:

> Hi, AllCan someone give some pointers to the code base to understand how auto reduce
parallelism works in Tez?
> -Saikat

View raw message