spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From SLiZn Liu <sliznmail...@gmail.com>
Subject Imported CSV file content isn't identical to the original file
Date Sun, 07 Feb 2016 07:44:00 GMT
Hi Spark Users Group,

I have a csv file to analysis with Spark, but I’m troubling with importing
as DataFrame.

Here’s the minimal reproducible example. Suppose I’m having a
*10(rows)x2(cols)* *space-delimited csv* file, shown as below:

1446566430 2015-11-04<SP>00:00:30
1446566430 2015-11-04<SP>00:00:30
1446566430 2015-11-04<SP>00:00:30
1446566430 2015-11-04<SP>00:00:30
1446566430 2015-11-04<SP>00:00:30
1446566431 2015-11-04<SP>00:00:31
1446566431 2015-11-04<SP>00:00:31
1446566431 2015-11-04<SP>00:00:31
1446566431 2015-11-04<SP>00:00:31
1446566431 2015-11-04<SP>00:00:31

the <SP> in column 2 represents sub-delimiter within that column, and this
file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv

I’m using *spark-csv* to import this file as Spark *DataFrame*:

sqlContext.read.format("com.databricks.spark.csv")
        .option("header", "false") // Use first line of all files as header
        .option("inferSchema", "false") // Automatically infer data types
        .option("delimiter", " ")
        .load("hdfs:///tmp/1.csv")
        .show

Oddly, the output shows only a part of each column:

[image: Screenshot from 2016-02-07 15-27-51.png]

and even the boundary of the table wasn’t shown correctly. I also used the
other way to read csv file, by sc.textFile(...).map(_.split(" ")) and
sqlContext.createDataFrame, and the result is the same. Can someone point
me out where I did it wrong?

—
BR,
Todd Leo
​

Mime
View raw message