spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From SLiZn Liu <sliznmail...@gmail.com>
Subject Re: Imported CSV file content isn't identical to the original file
Date Sun, 07 Feb 2016 10:09:19 GMT
Hi Igor,

In my case, it’s not a matter of *truncate*. As the show() function in
Spark API doc reads,

truncate: Whether truncate long strings. If true, strings more than 20
characters will be truncated and all cells will be aligned right…

whereas the leading characters of my two columns are missing.

Good to know the way to show the whole content in a cell.

—
BR,
Todd Leo
​




On Sun, Feb 7, 2016 at 5:42 PM Igor Berman <igor.berman@gmail.com> wrote:

> show has argument of truncate
> pass false so it wont truncate your results
>
> On 7 February 2016 at 11:01, SLiZn Liu <sliznmailbox@gmail.com> wrote:
>
>> Plus, I’m using *Spark 1.5.2*, with *spark-csv 1.3.0*. Also tried
>> HiveContext, but the result is exactly the same.
>> ​
>>
>> On Sun, Feb 7, 2016 at 3:44 PM SLiZn Liu <sliznmailbox@gmail.com> wrote:
>>
>>> Hi Spark Users Group,
>>>
>>> I have a csv file to analysis with Spark, but I’m troubling with
>>> importing as DataFrame.
>>>
>>> Here’s the minimal reproducible example. Suppose I’m having a
>>> *10(rows)x2(cols)* *space-delimited csv* file, shown as below:
>>>
>>> 1446566430 2015-11-04<SP>00:00:30
>>> 1446566430 2015-11-04<SP>00:00:30
>>> 1446566430 2015-11-04<SP>00:00:30
>>> 1446566430 2015-11-04<SP>00:00:30
>>> 1446566430 2015-11-04<SP>00:00:30
>>> 1446566431 2015-11-04<SP>00:00:31
>>> 1446566431 2015-11-04<SP>00:00:31
>>> 1446566431 2015-11-04<SP>00:00:31
>>> 1446566431 2015-11-04<SP>00:00:31
>>> 1446566431 2015-11-04<SP>00:00:31
>>>
>>> the <SP> in column 2 represents sub-delimiter within that column, and
>>> this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv
>>>
>>> I’m using *spark-csv* to import this file as Spark *DataFrame*:
>>>
>>> sqlContext.read.format("com.databricks.spark.csv")
>>>         .option("header", "false") // Use first line of all files as header
>>>         .option("inferSchema", "false") // Automatically infer data types
>>>         .option("delimiter", " ")
>>>         .load("hdfs:///tmp/1.csv")
>>>         .show
>>>
>>> Oddly, the output shows only a part of each column:
>>>
>>> [image: Screenshot from 2016-02-07 15-27-51.png]
>>>
>>> and even the boundary of the table wasn’t shown correctly. I also used
>>> the other way to read csv file, by sc.textFile(...).map(_.split(" "))
>>> and sqlContext.createDataFrame, and the result is the same. Can someone
>>> point me out where I did it wrong?
>>>
>>> —
>>> BR,
>>> Todd Leo
>>> ​
>>>
>>
>

Mime
View raw message