spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From SLiZn Liu <sliznmail...@gmail.com>
Subject Re: Imported CSV file content isn't identical to the original file
Date Sun, 07 Feb 2016 16:22:53 GMT
*Update*: on local mode(spark-shell --local[2], no matter read from local
file system or hdfs) , it works well. But it doesn’t solve this issue,
since my data scale requires hundreds of CPU cores and hundreds GB of RAM.

BTW, it’s Chinese Tradition New Year now, wish you all have a happy year
and have Great fortune in the Year of Monkey!

—
BR,
Todd Leo
​

On Sun, Feb 7, 2016 at 6:09 PM SLiZn Liu <sliznmailbox@gmail.com> wrote:

> Hi Igor,
>
> In my case, it’s not a matter of *truncate*. As the show() function in
> Spark API doc reads,
>
> truncate: Whether truncate long strings. If true, strings more than 20
> characters will be truncated and all cells will be aligned right…
>
> whereas the leading characters of my two columns are missing.
>
> Good to know the way to show the whole content in a cell.
>
> —
> BR,
> Todd Leo
> ​
>
>
>
>
> On Sun, Feb 7, 2016 at 5:42 PM Igor Berman <igor.berman@gmail.com> wrote:
>
>> show has argument of truncate
>> pass false so it wont truncate your results
>>
>> On 7 February 2016 at 11:01, SLiZn Liu <sliznmailbox@gmail.com> wrote:
>>
>>> Plus, I’m using *Spark 1.5.2*, with *spark-csv 1.3.0*. Also tried
>>> HiveContext, but the result is exactly the same.
>>> ​
>>>
>>> On Sun, Feb 7, 2016 at 3:44 PM SLiZn Liu <sliznmailbox@gmail.com> wrote:
>>>
>>>> Hi Spark Users Group,
>>>>
>>>> I have a csv file to analysis with Spark, but I’m troubling with
>>>> importing as DataFrame.
>>>>
>>>> Here’s the minimal reproducible example. Suppose I’m having a
>>>> *10(rows)x2(cols)* *space-delimited csv* file, shown as below:
>>>>
>>>> 1446566430 2015-11-04<SP>00:00:30
>>>> 1446566430 2015-11-04<SP>00:00:30
>>>> 1446566430 2015-11-04<SP>00:00:30
>>>> 1446566430 2015-11-04<SP>00:00:30
>>>> 1446566430 2015-11-04<SP>00:00:30
>>>> 1446566431 2015-11-04<SP>00:00:31
>>>> 1446566431 2015-11-04<SP>00:00:31
>>>> 1446566431 2015-11-04<SP>00:00:31
>>>> 1446566431 2015-11-04<SP>00:00:31
>>>> 1446566431 2015-11-04<SP>00:00:31
>>>>
>>>> the <SP> in column 2 represents sub-delimiter within that column, and
>>>> this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv
>>>>
>>>> I’m using *spark-csv* to import this file as Spark *DataFrame*:
>>>>
>>>> sqlContext.read.format("com.databricks.spark.csv")
>>>>         .option("header", "false") // Use first line of all files as header
>>>>         .option("inferSchema", "false") // Automatically infer data types
>>>>         .option("delimiter", " ")
>>>>         .load("hdfs:///tmp/1.csv")
>>>>         .show
>>>>
>>>> Oddly, the output shows only a part of each column:
>>>>
>>>> [image: Screenshot from 2016-02-07 15-27-51.png]
>>>>
>>>> and even the boundary of the table wasn’t shown correctly. I also used
>>>> the other way to read csv file, by sc.textFile(...).map(_.split(" "))
>>>> and sqlContext.createDataFrame, and the result is the same. Can
>>>> someone point me out where I did it wrong?
>>>>
>>>> —
>>>> BR,
>>>> Todd Leo
>>>> ​
>>>>
>>>
>>

Mime
View raw message