flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-1060) Support explicit shuffling of DataSets
Date Mon, 01 Sep 2014 12:44:21 GMT

    [ https://issues.apache.org/jira/browse/FLINK-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117387#comment-14117387

ASF GitHub Bot commented on FLINK-1060:

GitHub user fhueske opened a pull request:


    [FLINK-1060] Added methods to DataSet to explicitly hash-partition or rebalance the input
of Map-based operators.

    Adds three methods to DataSet:
    - `DataSet.partitionByHash(int...)`
    - `DataSet.partitionByHash(KeySelector)`
    - `DataSet.rebalance()`
    The methods create a PartitionedDataSet on which Map-based operators can be applied (`filter()`,
`map()`, `flatMap()`, and `mapPartition()`).
    Feedback welcome.
    I will add documentation once code changes are considered to be good to be merged.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/fhueske/incubator-flink userPartition

Alternatively you can review and apply these changes as the patch at:


To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #108
commit 75d73f8ee801575691d414ea9104fb63cf3ef494
Author: Fabian Hueske <fhueske@apache.org>
Date:   2014-08-19T23:03:56Z

    [FLINK-1060] Added methods to DataSet to explicitly hash-partition or rebalance the input
of Map-based operators.


> Support explicit shuffling of DataSets
> --------------------------------------
>                 Key: FLINK-1060
>                 URL: https://issues.apache.org/jira/browse/FLINK-1060
>             Project: Flink
>          Issue Type: Improvement
>          Components: Java API, Optimizer
>            Reporter: Fabian Hueske
>            Assignee: Fabian Hueske
>            Priority: Minor
> Right now, Flink only shuffles data if it is required by some operation such as Reduce,
Join, or CoGroup. There is no way to explicitly shuffle a data set.
> However, in some situations explicit shuffling would be very helpful including:
> - rebalancing before compute-intensive Map operations
> - balancing, random or hash partitioning before PartitionMap operations (see FLINK-1053)
> - better integration of support for HadoopJobs (see FLINK-838)
> With this issue, I propose to add the following methods to {{DataSet}}
> - {{DataSet.partitionHashBy(int...)}} and {{DataSet.partitionHashBy(KeySelector)}} to
perform an explicit hash partitioning
> - {{DataSet.partitionRandomly()}} to shuffle data completely random
> - {{DataSet.partitionRoundRobin()}} to shuffle data in a round-robin fashion that generates
very even distribution with possible bias due to prior distributions
> The {{DataSet.partitionRoundRobin()}} might not be necessary if we think that random
shuffling balances good enough.

This message was sent by Atlassian JIRA

View raw message