spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-26786) Handle to treat escaped newline characters('\r','\n') in spark csv
Date Fri, 01 Feb 2019 02:35:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-26786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16757898#comment-16757898
] 

Hyukjin Kwon commented on SPARK-26786:
--------------------------------------

This behaviour is inherited from Univocity parser if I am not mistaken. Can you iterate with
this issue there?

> Handle to treat escaped newline characters('\r','\n') in spark csv
> ------------------------------------------------------------------
>
>                 Key: SPARK-26786
>                 URL: https://issues.apache.org/jira/browse/SPARK-26786
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output, PySpark, SQL
>    Affects Versions: 2.3.0
>            Reporter: vishnuram selvaraj
>            Priority: Major
>
> There are some systems like AWS redshift which writes csv files by escaping newline
characters('\r','\n') in addition to escaping the quote characters, if they come as part of
the data.
> Redshift documentation link([https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html)] and
below is their mention of escaping requirements in the mentioned link
> ESCAPE
> For CHAR and VARCHAR columns in delimited unload files, an escape character (\{{}}) is
placed before every occurrence of the following characters:
>  * Linefeed: {{\n}}
>  * Carriage return: {{\r}}
>  * The delimiter character specified for the unloaded data.
>  * The escape character: \{{}}
>  * A quote character: {{"}} or {{'}} (if both ESCAPE and ADDQUOTES are specified
in the UNLOAD command).
>  
> *Problem statement:* 
> But the spark CSV reader doesn't have a handle to treat/remove the escape characters
infront of the newline characters in the data.
> It would really help if we can add a feature to handle the escaped newline characters
through another parameter like (escapeNewline = 'true/false').
> *Example:*
> Below are the details of my test data set up in a file.
>  * The first record in that file has escaped windows newline character (
>  r
>  n)
>  * The third record in that file has escaped unix newline character (
>  n)
>  * The fifth record in that file has the escaped quote character (")
> the file looks like below in vi editor:
>  
> {code:java}
> "1","this is \^M\
> line1"^M
> "2","this is line2"^M
> "3","this is \
> line3"^M
> "4","this is \" line4"^M
> "5","this is line5"^M{code}
>  
> When I read the file in python's csv module with escape, it is able to remove the added
escape characters as you can see below,
>  
> {code:java}
> >>> with open('/tmp/test3.csv','r') as readCsv:
> ... readFile = csv.reader(readCsv,dialect='excel',escapechar='\\',quotechar='"',delimiter=',',doublequote=False)
> ... for row in readFile:
> ... print(row)
> ...
> ['1', 'this is \r\n line1']
> ['2', 'this is line2']
> ['3', 'this is \n line3']
> ['4', 'this is " line4']
> ['5', 'this is line5']
> {code}
> But if I read the same file in spark-csv reader, the escape characters infront of the
newline characters are not removed.But the escape before the (") is removed.
> {code:java}
> >>> redDf=spark.read.csv(path='file:///tmp/test3.csv',header='false',sep=',',quote='"',escape='\\',multiLine='true',ignoreLeadingWhiteSpace='true',ignoreTrailingWhiteSpace='true',mode='FAILFAST',inferSchema='false')
> >>> redDf.show()
> +---+------------------+
> |_c0| _c1|
> +---+------------------+
> \ 1|this is \
> line1|
> | 2| this is line2|
> | 3| this is \
> line3|
> | 4| this is " line4|
> | 5| this is line5|
> +---+------------------+
> {code}
>  *Expected result:*
> {code:java}
> +---+------------------+
> |_c0| _c1|
> +---+------------------+
> | 1|this is 
> line1|
> | 2| this is line2|
> | 3| this is 
> line3|
> | 4| this is " line4|
> | 5| this is line5|
> +---+------------------+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message