spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Riccardo Ferrari <ferra...@gmail.com>
Subject Re: Problem with CSV line break data in PySpark 2.1.0
Date Sun, 03 Sep 2017 21:08:27 GMT
Hi Aakash,

What I see in the picture seems correct. Spark (pyspark) is reading your F2
cell as a multi-line text. Where are the nulls you're referring to?
You might find the pyspark.sql.functions.regexp_replace
<http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.regexp_replace>
useful
to remove new lines and unwanted characters:
df.select(..., regexp_replace(<column-name>, '\s+|\n', ' '), ...)

Best,

On Sun, Sep 3, 2017 at 12:15 PM, Aakash Basu <aakash.spark.raj@gmail.com>
wrote:

> Hi,
>
> I've a dataset where a few rows of the column F as shown below have line
> breaks in CSV file.
>
> [image: Inline image 1]
>
> When Spark is reading it, it is coming as below, which is a complete new
> line.
>
> [image: Inline image 2]
>
> I want my PySpark 2.1.0 to read it by forcefully avoiding the line break
> after the date, which is not happening as I am using com.databricks.csv
> reader. And nulls are getting created after the date for line 2 for the
> rest of the columns from G till end.
>
> Can I please be helped how to handle this?
>
> Thanks,
> Aakash.
>

Mime
View raw message