tez-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bikas Saha <bi...@apache.org>
Subject Re: Tez runtime repartitioning
Date Tue, 03 Jan 2017 23:43:25 GMT
In general this works when the source vertex is over-partitioned such that
the destination can merge partitions. Thus repartitioning in not necessary.
Since merged partitions are adjacent to each other, this should work for
both hash and range partitioned cases, because relative order is preserved.

To be clear, this is effectively a user land functionality that’s happens
to be present in the Tez code base. Users can write their own coalescing
logic in a custom vertex manager and do their own thing. Hence, correctness
of this optimization is a user determined activity and needs to be enabled
with care.

Bikas

On Mon, Jan 2, 2017 at 10:08 AM, Robert Grandl <rgrandl@yahoo.com.invalid>
wrote:

> Thanks a lot Rajesh. This is very helpful.
>
> - Robert
>
>     On Sunday, January 1, 2017 11:18 PM, Rajesh Balamohan <
> rbalamohan@apache.org> wrote:
>
>
>  Tez uses IFile as its intermediate file output format. This is similar to
> MR, but in tez good amount of optimizations have gone in terms of densely
> packing the data.
>
> It ends up writing intermediate data for all partitions in a file and
> creates an index file. Intermediate data is served out by NodeManagers via
> http.
>
> Write codepath for ordered outputs:
> https://github.com/apache/tez/blob/master/tez-runtime-
> library/src/main/java/org/apache/tez/runtime/library/common/sort/impl/
> PipelinedSorter.java#L554
> Write codepath for unordered outputs:
> https://github.com/apache/tez/blob/master/tez-runtime-
> library/src/main/java/org/apache/tez/runtime/library/common/writers/
> UnorderedPartitionedKVWriter.java#L440
>
> Tez does not rewrite any partitions for auto-reduce parallelism which can
> be very expensive. Instead it manages the repartitions by logical mapping
> and intelligent routing in ShuffleVertexManager.
>
> Assume the following 2 vertices connected by scatter-gather edge.
>
> V1 (143 tasks) ----> V2 (64 tasks)
>
> Assume after auto-reduce parallelism, the number of tasks in V2 is computed
> to be 4. There is a lot which happens in the code in terms of doing the
> mapping and rounting. But at a high level in this case,
> {0..15} partitions from all source tasks would be sent to task 0 in V2.
> {16..31} partitions from all source tasks would be sent to task 1 in V2.
> ..
> {48..63} partitions from all source tasks would be sent to task 3 in V2.
>
> It is possible to specify the minimum number of tasks that needs to be
> present in auto-reduce parallelism via
> "tez.shuffle-vertex-manager.min-task-parallelism". User can configure it
> to
> say "3" (just example) as a safety net. So at runtime, number of tasks in
> V2 would not go below 3.
>
> ~Rajesh.B
>
> On Sat, Dec 31, 2016 at 1:40 AM, Robert Grandl <rgrandl@yahoo.com.invalid>
> wrote:
>
> > Hi guys,
> > I know that Tez is able to automatically tune the number of downstream
> > tasks based on statistics regarding the output of the tasks from the
> > upstream tasks. Upstream tasks can be map tasks, downstream can be reduce
> > tasks in MR parlance.
> >
> > It seems that Tez is somehow able to repartition the key ranges to adjust
> > the new number of downstream tasks computed at runtime based on these
> > statistics.
> >
> > I have the following questions:1. When the upstream tasks are executing,
> > how is the output data partitioned? I assume it should assume certain
> > key-range splitting into separate partitions which are written to disk
> into
> > intermediate files.
> > 2. After certain upstream tasks have finished and the number of
> > downstream tasks are adjusted based on the expected output, then the data
> > will basically be repartitioned. That means if initially my data was
> going
> > into 10 partitions, now it may go to 2 because Tez decided that only 2
> > downstream tasks are enough to fetch the data. If this is the case, how
> the
> > repartitioning happens? What key ranges from the initial partitions will
> go
> > to what key ranges from the computed new partitions?
> > Thanks in advance,Robert
> >
> >
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message