tez-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gopal V (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (TEZ-2104) A CrossProductEdge which produces synthetic cross-product parallelism
Date Fri, 02 Feb 2018 07:11:03 GMT

     [ https://issues.apache.org/jira/browse/TEZ-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Gopal V resolved TEZ-2104.
       Resolution: Fixed
    Fix Version/s: 0.9.1

> A CrossProductEdge which produces synthetic cross-product parallelism
> ---------------------------------------------------------------------
>                 Key: TEZ-2104
>                 URL: https://issues.apache.org/jira/browse/TEZ-2104
>             Project: Apache Tez
>          Issue Type: New Feature
>            Reporter: Gopal V
>            Assignee: Zhiyuan Yang
>            Priority: Major
>              Labels: gsoc, gsoc2015, hadoop, hive, java, tez
>             Fix For: 0.9.1
>         Attachments: Cartesian product edge design.2.pdf, Cross product edge design.pdf
> Instead of producing duplicate data for the synthetic cross-product, to fit into partitions,
the amount of net IO can be vastly reduced by a special purpose cross-product data movement
> The Shuffle edge routes each partition's output to a single reducer, while the cross-product
edge routes it into a matrix of reducers without actually duplicating the disk data.
> A partitioning scheme with 3 partitions on the lhs and rhs of a join operation can be
routed into 9 reducers by performing a cross-product similar to 
> (1,2,3) x (a,b,c) = [(1,a), (1,b), (1,c), (2,a), (2,b) ...]
> This turns a single task cross-product model into a distributed cross product.

This message was sent by Atlassian JIRA

View raw message