tez-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ming Ma (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TEZ-3209) Support for fair custom data routing
Date Fri, 08 Apr 2016 23:58:25 GMT
Ming Ma created TEZ-3209:

             Summary: Support for fair custom data routing
                 Key: TEZ-3209
                 URL: https://issues.apache.org/jira/browse/TEZ-3209
             Project: Apache Tez
          Issue Type: New Feature
            Reporter: Ming Ma

This is based on offline discussion with [~gopalv], [~hitesh], [~jrottinghuis] and [~lohit]
w.r.t. the support for efficient processing of highly skewed unordered partitioned mapper
output. Our use case is to demux highly skewed unordered category data partitioned by category
name. Gopal and Hitesh mentioned dynamically shuffled join scenario.

One option we discussed is to leverage auto-parallelism feature with upfront over-partitioning.
That means possible overhead to support large number partitions and unnecessary data movement
as each reducer needs to get data from all mappers. 

Another alternative is to use custom {{DataMovementType}} which doesn't require each reducer
to fetch data from all mappers. In that way, a large partition will be processed by several
reducers, each of which will fetch data from a portion of mappers.

For example, say there are 100 mappers each of which has 10 partitions {P1, ..., P10}. Each
mapper generates 100MB for its P10 and 1MB for each of {P1, ... P9}. The default SCATTER_GATHER
routing means the reducer for P10 has to process 10GB of input and becomes the bottleneck
of the job. With the fair custom data routing, The P10 belonging to the first 10 mappers will
be processed by one reducer. The P10 belonging to the second 10 mappers will be processed
by another reducer, etc.

For further optimization, we can allocate the reducer on the same nodes as the mappers that
it fetches data from.

To support this, we need TEZ-3206 as well as new class for {{VertexManagerPlugin}} and new
class for {{EdgeManagerPluginOnDemand}}.

This message was sent by Atlassian JIRA

View raw message