spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luciano Resende <luckbr1...@gmail.com>
Subject Re: Imported CSV file content isn't identical to the original file
Date Mon, 08 Feb 2016 06:12:51 GMT
I tried in both 1.5.0, 1.6.0 and 2.0.0 trunk and
com.databricks:spark-csv_2.10:1.3.0 with expected results, where the
columns seem to be read properly.

 +----------+----------------------+
|C0        |C1                    |
+----------+----------------------+
|1446566430 | 2015-11-04<SP>00:00:30|
|1446566430 | 2015-11-04<SP>00:00:30|
|1446566430 | 2015-11-04<SP>00:00:30|
|1446566430 | 2015-11-04<SP>00:00:30|
|1446566430 | 2015-11-04<SP>00:00:30|
|1446566431 | 2015-11-04<SP>00:00:31|
|1446566431 | 2015-11-04<SP>00:00:31|
|1446566431 | 2015-11-04<SP>00:00:31|
|1446566431 | 2015-11-04<SP>00:00:31|
|1446566431 | 2015-11-04<SP>00:00:31|
+----------+----------------------+




On Sat, Feb 6, 2016 at 11:44 PM, SLiZn Liu <sliznmailbox@gmail.com> wrote:

> Hi Spark Users Group,
>
> I have a csv file to analysis with Spark, but I’m troubling with importing
> as DataFrame.
>
> Here’s the minimal reproducible example. Suppose I’m having a
> *10(rows)x2(cols)* *space-delimited csv* file, shown as below:
>
> 1446566430 2015-11-04<SP>00:00:30
> 1446566430 2015-11-04<SP>00:00:30
> 1446566430 2015-11-04<SP>00:00:30
> 1446566430 2015-11-04<SP>00:00:30
> 1446566430 2015-11-04<SP>00:00:30
> 1446566431 2015-11-04<SP>00:00:31
> 1446566431 2015-11-04<SP>00:00:31
> 1446566431 2015-11-04<SP>00:00:31
> 1446566431 2015-11-04<SP>00:00:31
> 1446566431 2015-11-04<SP>00:00:31
>
> the <SP> in column 2 represents sub-delimiter within that column, and
> this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv
>
> I’m using *spark-csv* to import this file as Spark *DataFrame*:
>
> sqlContext.read.format("com.databricks.spark.csv")
>         .option("header", "false") // Use first line of all files as header
>         .option("inferSchema", "false") // Automatically infer data types
>         .option("delimiter", " ")
>         .load("hdfs:///tmp/1.csv")
>         .show
>
> Oddly, the output shows only a part of each column:
>
> [image: Screenshot from 2016-02-07 15-27-51.png]
>
> and even the boundary of the table wasn’t shown correctly. I also used the
> other way to read csv file, by sc.textFile(...).map(_.split(" ")) and
> sqlContext.createDataFrame, and the result is the same. Can someone point
> me out where I did it wrong?
>
> —
> BR,
> Todd Leo
> ​
>



-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/

Mime
View raw message