spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-14194) spark csv reader not working properly if CSV content contains CRLF character (newline) in the intermediate cell
Date Fri, 17 Feb 2017 08:35:42 GMT

    [ https://issues.apache.org/jira/browse/SPARK-14194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15871427#comment-15871427
] 

Sean Owen commented on SPARK-14194:
-----------------------------------

This is hard to fix because the source of text data splits this into lines before it's ever
seen by the CSV parser. I can imagine trying to stitch them back together with a transform
over a window of lines, but it's going to be hard to do given how the plumbing works.

> spark csv reader not working properly if CSV content contains CRLF character (newline)
in the intermediate cell
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-14194
>                 URL: https://issues.apache.org/jira/browse/SPARK-14194
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.2, 2.1.0
>            Reporter: Kumaresh C R
>
> We have CSV content like below,
> Sl.NO, Employee_Name, Company, Address, Country, ZIP_Code\n\r
> "1", "ABCD", "XYZ", "1234", "XZ Street \n\r(CRLF charater), Municapality,....","USA",
"1234567"
> Since there is a '\n\r' character in the row middle (to be exact in the Address Column),
when we execute the below spark code, it tries to create the dataframe with two rows (excluding
header row), which is wrong. Since we have specified delimiter as quote (") character , why
it takes the middle character as newline character ? This creates an issue while processing
the created dataframe.
>  DataFrame df = sqlContextManager.getSqlContext().read().format("com.databricks.spark.csv")
>                     .option("header", "true")
>                     .option("inferSchema", "true")
>                     .option("delimiter", delim)
>                     .option("quote", quote)
>                     .option("escape", escape)
>                     .load(sourceFile);
>    



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message