flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-2692) Untangle CsvInputFormat into PojoTypeCsvInputFormat and TupleTypeCsvInputFormat
Date Sun, 18 Oct 2015 22:35:05 GMT

    [ https://issues.apache.org/jira/browse/FLINK-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14962663#comment-14962663

ASF GitHub Bot commented on FLINK-2692:

GitHub user zentol opened a pull request:


    [FLINK-2692] Untangle CsvInputFormat

    This PR splits the CsvInputFormat into a Tuple and POJO Version. To this end, The (Common)CsvInputFormat
classes were merged, and the type specific portions refactored into separate classes.
    Additionally, the ScalaCsvInputFormat has been removed; Java and Scala API now use the
same InputFormats. Previously, the formats differed in the way they created the output tuples;
this is now realized in a newly introduced abstract method "createOrReuseInstance(Object[]
fieldValues, T reuse)" within the TupleSerializerBase.
    Fields to include and field names are no longer passed via setters, but instead via the
contructor. Several new contructors were added to accommodate different use cases, along with
2 new static methods to generate a default include mask, or convert an indice int[] list to
a boolean include mask.
    Classes no longer have to be passed separately, as they are extracted from the typeinformation
    A few sanity checks were moved from the ExecEnvironment to the InputFormat.
    The testReadSparseWithShuffledPositions Test was removed since monotonous order of field
indices is, and afaik was, not actually necessary due to the way it was converted to a boolean[].

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zentol/flink 2692_csv

Alternatively you can review and apply these changes as the patch at:


To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1266
commit d497415adc2e58b4e9912ae89a53444825416366
Author: zentol <s.motsu@web.de>
Date:   2015-10-18T18:23:23Z

    [FLINK-2692] Untangle CsvInputFormat


> Untangle CsvInputFormat into PojoTypeCsvInputFormat and TupleTypeCsvInputFormat 
> --------------------------------------------------------------------------------
>                 Key: FLINK-2692
>                 URL: https://issues.apache.org/jira/browse/FLINK-2692
>             Project: Flink
>          Issue Type: Improvement
>            Reporter: Till Rohrmann
>            Assignee: Chesnay Schepler
>            Priority: Minor
> The {{CsvInputFormat}} currently allows to return values as a {{Tuple}} or a {{Pojo}}
type. As a consequence, the processing logic, which has to work for both types, is overly
complex. For example, the {{CsvInputFormat}} contains fields which are only used when a Pojo
is returned. Moreover, the pojo field information are constructed by calling setter methods
which have to be called in a very specific order, otherwise they fail. E.g. one first has
to call {{setFieldTypes}} before calling {{setOrderOfPOJOFields}}, otherwise the number of
fields might be different. Furthermore, some of the methods can only be called if the return
type is a {{Pojo}} type, because they expect that a {{PojoTypeInfo}} is present.
> I think the {{CsvInputFormat}} should be refactored to make the code more easily maintainable.
I propose to split it up into a {{PojoTypeCsvInputFormat}} and a {{TupleTypeCsvInputFormat}}
which take all the required information via their constructors instead of using the {{setFields}}
and {{setOrderOfPOJOFields}} approach.

This message was sent by Atlassian JIRA

View raw message