flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-1963) Improve distinct() transformation
Date Mon, 13 Jul 2015 12:38:04 GMT

    [ https://issues.apache.org/jira/browse/FLINK-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14624587#comment-14624587
] 

ASF GitHub Bot commented on FLINK-1963:
---------------------------------------

Github user chiwanpark commented on the pull request:

    https://github.com/apache/flink/pull/905#issuecomment-120913343
  
    Hi, @pp86 Thanks for your contribution.
    
    But I think that using `AutoSelector` is not the best approach to improve distinct transformation.
In Flink, a `KeySelector` converts a `DataSet<O>` to `DataSet<Tuple2<K, O>>`
and uses the first element of the tuple as key. For atomic types, `AutoSelector` creates `DataSet<Tuple2<V,
V>>` which unnecessarily duplicated data.
    
    I recommend `Keys.ExpressionKeys` when the user call `distinct()` method on atomic data
types.
    
    And It would be better to add the test cases for this changes.


> Improve distinct() transformation
> ---------------------------------
>
>                 Key: FLINK-1963
>                 URL: https://issues.apache.org/jira/browse/FLINK-1963
>             Project: Flink
>          Issue Type: Improvement
>          Components: Java API, Scala API
>    Affects Versions: 0.9
>            Reporter: Fabian Hueske
>            Assignee: pietro pinoli
>            Priority: Minor
>              Labels: starter
>             Fix For: 0.9
>
>
> The `distinct()` transformation is a bit limited right now with respect to processing
atomic key types:
> - `distinct(String ...)` works only for composite data types (POJO, tuple), but wildcard
expression should also be supported for atomic key types
> - `distinct()` only works for composite types, but should also work for atomic key types
> - `distinct(KeySelector)` is the most generic one, but not very handy to use
> - `distinct(int ...)` works only for Tuple data types (which is fine)
> Fixing this should be rather easy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message