tez-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ming Ma (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TEZ-3430) Make split sorting optional
Date Wed, 07 Sep 2016 20:00:24 GMT
Ming Ma created TEZ-3430:
----------------------------

             Summary: Make split sorting optional
                 Key: TEZ-3430
                 URL: https://issues.apache.org/jira/browse/TEZ-3430
             Project: Apache Tez
          Issue Type: Bug
            Reporter: Ming Ma


The fair routing design in TEZ-3209 addresses the skewed partitions where one partition could
be much larger than the others. But to simplify the stats tracking, it assumes a given partition's
data is distributed evenly to some degree across source tasks so that it can group consecutive
source tasks together.

However, this assumption is invalid given {{MRInputHelpers}}'s generateNewSplits and generateOldSplits
sort the splits by size, thus the data size in the beginning of source task range is bigger
than that of at the end.

{noformat}
Arrays.sort(splits, new InputSplitComparator());
{noformat}

One way to fix this is to have fair routing track not only the aggregated size of each partition,
but also the size of each partition of each source task. But that will significantly increase
the memory footprint.

Alternatively, it can skip the sorting above. Test results for TEZ-3209 show that jobs can
finish 30% faster, given the source tasks output size is more balanced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message