tez-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Grandl <rgra...@yahoo.com.INVALID>
Subject Re: Tez runtime repartitioning
Date Mon, 02 Jan 2017 18:08:16 GMT
Thanks a lot Rajesh. This is very helpful. 

- Robert 

    On Sunday, January 1, 2017 11:18 PM, Rajesh Balamohan <rbalamohan@apache.org> wrote:

 Tez uses IFile as its intermediate file output format. This is similar to
MR, but in tez good amount of optimizations have gone in terms of densely
packing the data.

It ends up writing intermediate data for all partitions in a file and
creates an index file. Intermediate data is served out by NodeManagers via

Write codepath for ordered outputs:
Write codepath for unordered outputs:

Tez does not rewrite any partitions for auto-reduce parallelism which can
be very expensive. Instead it manages the repartitions by logical mapping
and intelligent routing in ShuffleVertexManager.

Assume the following 2 vertices connected by scatter-gather edge.

V1 (143 tasks) ----> V2 (64 tasks)

Assume after auto-reduce parallelism, the number of tasks in V2 is computed
to be 4. There is a lot which happens in the code in terms of doing the
mapping and rounting. But at a high level in this case,
{0..15} partitions from all source tasks would be sent to task 0 in V2.
{16..31} partitions from all source tasks would be sent to task 1 in V2.
{48..63} partitions from all source tasks would be sent to task 3 in V2.

It is possible to specify the minimum number of tasks that needs to be
present in auto-reduce parallelism via
"tez.shuffle-vertex-manager.min-task-parallelism". User can configure it to
say "3" (just example) as a safety net. So at runtime, number of tasks in
V2 would not go below 3.


On Sat, Dec 31, 2016 at 1:40 AM, Robert Grandl <rgrandl@yahoo.com.invalid>

> Hi guys,
> I know that Tez is able to automatically tune the number of downstream
> tasks based on statistics regarding the output of the tasks from the
> upstream tasks. Upstream tasks can be map tasks, downstream can be reduce
> tasks in MR parlance.
> It seems that Tez is somehow able to repartition the key ranges to adjust
> the new number of downstream tasks computed at runtime based on these
> statistics.
> I have the following questions:1. When the upstream tasks are executing,
> how is the output data partitioned? I assume it should assume certain
> key-range splitting into separate partitions which are written to disk into
> intermediate files.
> 2. After certain upstream tasks have finished and the number of
> downstream tasks are adjusted based on the expected output, then the data
> will basically be repartitioned. That means if initially my data was going
> into 10 partitions, now it may go to 2 because Tez decided that only 2
> downstream tasks are enough to fetch the data. If this is the case, how the
> repartitioning happens? What key ranges from the initial partitions will go
> to what key ranges from the computed new partitions?
> Thanks in advance,Robert

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message