flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Luke Hutchison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-6016) Newlines should be valid in quoted strings in CSV
Date Fri, 01 Sep 2017 17:20:01 GMT

    [ https://issues.apache.org/jira/browse/FLINK-6016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150860#comment-16150860

Luke Hutchison commented on FLINK-6016:

Yes, that's what I'm suggesting. The data doesn't have to be read twice, it can be emitted
in the first pass, but the efficiency of doing so depends on the bandwidth between the single
reading thread and the worker threads for each shard.

A more scalable approach, though more complex, would be to build a state machine for each
shard, recording the state at each input character, and then "run off the end" of each shard
boundary until the state of the parser from the previous shard matches the state of the parser
for the next shard at the same character position. The "overrun" parser state overwrites the
next shard parser state until the states match. Then the state marker for unquoted newline
is found to determine line breaks.

> Newlines should be valid in quoted strings in CSV
> -------------------------------------------------
>                 Key: FLINK-6016
>                 URL: https://issues.apache.org/jira/browse/FLINK-6016
>             Project: Flink
>          Issue Type: Bug
>          Components: Batch Connectors and Input/Output Formats
>    Affects Versions: 1.2.0
>            Reporter: Luke Hutchison
> The RFC for the CSV format specifies that newlines are valid in quoted strings in CSV:
> https://tools.ietf.org/html/rfc4180
> However, when parsing a CSV file with Flink containing a newline, such as:
> {noformat}
> "3
> 4",5
> {noformat}
> you get this exception:
> {noformat}
> Line could not be parsed: '"3'
> Expect field types: class java.lang.String, class java.lang.String 
> {noformat}

This message was sent by Atlassian JIRA

View raw message