spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luciano Resende <luckbr1...@gmail.com>
Subject Re: Imported CSV file content isn't identical to the original file
Date Mon, 08 Feb 2016 18:37:55 GMT
Sorry, same expected results with trunk and Kryo serializer

On Mon, Feb 8, 2016 at 4:15 AM, SLiZn Liu <sliznmailbox@gmail.com> wrote:

> I’ve found the trigger of my issue: if I start my spark-shell or submit
> by spark-submit with --conf
> spark.serializer=org.apache.spark.serializer.KryoSerializer, the
> DataFrame content goes wrong, as I described earlier.
> ​
>
> On Mon, Feb 8, 2016 at 5:42 PM SLiZn Liu <sliznmailbox@gmail.com> wrote:
>
>> Thanks Luciano, now it looks like I’m the only guy who have this issue.
>> My options is narrowed down to upgrade my spark to 1.6.0, to see if this
>> issue is gone.
>>
>> —
>> Cheers,
>> Todd Leo
>>
>>
>> ​
>> On Mon, Feb 8, 2016 at 2:12 PM Luciano Resende <luckbr1975@gmail.com>
>> wrote:
>>
>>> I tried in both 1.5.0, 1.6.0 and 2.0.0 trunk and
>>> com.databricks:spark-csv_2.10:1.3.0 with expected results, where the
>>> columns seem to be read properly.
>>>
>>>  +----------+----------------------+
>>> |C0        |C1                    |
>>> +----------+----------------------+
>>>
>>> |1446566430 | 2015-11-04<SP>00:00:30|
>>> |1446566430 | 2015-11-04<SP>00:00:30|
>>> |1446566430 | 2015-11-04<SP>00:00:30|
>>> |1446566430 | 2015-11-04<SP>00:00:30|
>>> |1446566430 | 2015-11-04<SP>00:00:30|
>>> |1446566431 | 2015-11-04<SP>00:00:31|
>>> |1446566431 | 2015-11-04<SP>00:00:31|
>>> |1446566431 | 2015-11-04<SP>00:00:31|
>>> |1446566431 | 2015-11-04<SP>00:00:31|
>>> |1446566431 | 2015-11-04<SP>00:00:31|
>>> +----------+----------------------+
>>>
>>>
>>>
>>>
>>> On Sat, Feb 6, 2016 at 11:44 PM, SLiZn Liu <sliznmailbox@gmail.com>
>>> wrote:
>>>
>>>> Hi Spark Users Group,
>>>>
>>>> I have a csv file to analysis with Spark, but I’m troubling with
>>>> importing as DataFrame.
>>>>
>>>> Here’s the minimal reproducible example. Suppose I’m having a
>>>> *10(rows)x2(cols)* *space-delimited csv* file, shown as below:
>>>>
>>>> 1446566430 2015-11-04<SP>00:00:30
>>>> 1446566430 2015-11-04<SP>00:00:30
>>>> 1446566430 2015-11-04<SP>00:00:30
>>>> 1446566430 2015-11-04<SP>00:00:30
>>>> 1446566430 2015-11-04<SP>00:00:30
>>>> 1446566431 2015-11-04<SP>00:00:31
>>>> 1446566431 2015-11-04<SP>00:00:31
>>>> 1446566431 2015-11-04<SP>00:00:31
>>>> 1446566431 2015-11-04<SP>00:00:31
>>>> 1446566431 2015-11-04<SP>00:00:31
>>>>
>>>> the <SP> in column 2 represents sub-delimiter within that column, and
>>>> this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv
>>>>
>>>> I’m using *spark-csv* to import this file as Spark *DataFrame*:
>>>>
>>>> sqlContext.read.format("com.databricks.spark.csv")
>>>>         .option("header", "false") // Use first line of all files as header
>>>>         .option("inferSchema", "false") // Automatically infer data types
>>>>         .option("delimiter", " ")
>>>>         .load("hdfs:///tmp/1.csv")
>>>>         .show
>>>>
>>>> Oddly, the output shows only a part of each column:
>>>>
>>>> [image: Screenshot from 2016-02-07 15-27-51.png]
>>>>
>>>> and even the boundary of the table wasn’t shown correctly. I also used
>>>> the other way to read csv file, by sc.textFile(...).map(_.split(" "))
>>>> and sqlContext.createDataFrame, and the result is the same. Can
>>>> someone point me out where I did it wrong?
>>>>
>>>> —
>>>> BR,
>>>> Todd Leo
>>>> ​
>>>>
>>>
>>>
>>>
>>> --
>>> Luciano Resende
>>> http://people.apache.org/~lresende
>>> http://twitter.com/lresende1975
>>> http://lresende.blogspot.com/
>>>
>>


-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/

Mime
View raw message