flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fabian Hueske (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (FLINK-1043) Alternative combine interface
Date Fri, 08 Aug 2014 14:24:13 GMT

    [ https://issues.apache.org/jira/browse/FLINK-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14090793#comment-14090793
] 

Fabian Hueske edited comment on FLINK-1043 at 8/8/14 2:23 PM:
--------------------------------------------------------------

Right now a Combiner is considered to be optional and the Reduce must be able to consume its
input data even if the Combiner was not executed. 
This can happen, if the data is already partitioned in a suitable way. In this case, a Combiner
would cause additional overhead and not help to reduce the data to be shuffled (as there is
nothing to be shuffled).

Introducing a new combiner interface would make the execution of the combiner mandatory which
needs to be considered during optimization. 
However, I think this would go well with our recent changes to extract combine into an interface
(FLINK-848).


was (Author: fhueske):
Right now a Combiner is considered to be optional and the Reduce must be able to consume to
consume its input data even if the Combiner was not executed. 
This can happen, if the data is already partitioned in a suitable way. In this case, a Combiner
would cause additional overhead and not help to reduce the data to be shuffled (as there is
nothing to be shuffled).

Introducing a new combiner interface would make the execution of the combiner mandatory which
needs to be considered during optimization. 
However, I think this would go well with our recent changes to extract combine into an interface
(FLINK-848).

> Alternative combine interface
> -----------------------------
>
>                 Key: FLINK-1043
>                 URL: https://issues.apache.org/jira/browse/FLINK-1043
>             Project: Flink
>          Issue Type: Wish
>            Reporter: Sebastian Kruse
>
> The GroupReduce allows for the following combination reduce step: {{InputType}} ->
combine -> {{InputType}} -> reduce -> {{OutputType}}. However, in the use cases I
have stumbled upon so far, it would make more sense to have the following steps: {{InputType}}
-> {{OutputType}} -> {{OutputType}}. It seems more intuitive to me to create a set of
partial results with the combiners that will finally merged within the reduce phase into an
overall result. This sometimes bars me from using a combiner.
> I provide some examples for this intuition.
> * WordCount
> ** If you want to implement WordCount with as a combinable GroupReduce, then you have
to preprocess all words as {{Tuple2<String, 1>}}. This could be avoided if the combination
result was not necessarily equal to the input type.
> * create a Bloom filter
> ** Bloom filters can be created locally on each node and later on be merged into a final,
global Bloom filter, thus lend themselves for a combine-reduce proceeding. Doing this with
a combinable GroupReduce would currently require to turn each input element into a singleton
Bloom filter before the combination phase.
> Therefore, it would be nice to have the ability to use {{OutputType}} as the combiner
result.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message