spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hyukjin Kwon <gurwls...@gmail.com>
Subject Re: [Spark2] Error writing "complex" type to CSV
Date Mon, 22 Aug 2016 07:30:42 GMT
Whether it writes the data as garbage or string representation, this is not
able to load back. So, I'd say both are wrong and bugs.

I think it'd be great if we can write and read back CSV in its own format
but I guess we can't for now.


2016-08-20 2:54 GMT+09:00 Efe Selcuk <efeman92@gmail.com>:

> Okay so this is partially PEBKAC. I just noticed that there's a debugging
> field at the end that's another case class with its own simple fields -
> *that's* the struct that was showing up in the error, not the entry
> itself.
>
> This raises a different question. What has changed that this is no longer
> possible? The pull request said that it prints garbage. Was that some
> regression in 2.0? The same code prints fine in 1.6.1. The field prints as
> an array of the values of its fields.
>
> On Thu, Aug 18, 2016 at 5:56 PM, Hyukjin Kwon <gurwls223@gmail.com> wrote:
>
>> Ah, BTW, there is an issue, SPARK-16216, about printing dates and
>> timestamps here. So please ignore the integer values for dates
>>
>> 2016-08-19 9:54 GMT+09:00 Hyukjin Kwon <gurwls223@gmail.com>:
>>
>>> Ah, sorry, I should have read this carefully. Do you mind if I ask your
>>> codes to test?
>>>
>>> I would like to reproduce.
>>>
>>>
>>> I just tested this by myself but I couldn't reproduce as below (is this
>>> what your doing, right?):
>>>
>>> case class ClassData(a: String, b: Date)
>>>
>>> val ds: Dataset[ClassData] = Seq(
>>>   ("a", Date.valueOf("1990-12-13")),
>>>   ("a", Date.valueOf("1990-12-13")),
>>>   ("a", Date.valueOf("1990-12-13"))
>>> ).toDF("a", "b").as[ClassData]
>>> ds.write.csv("/tmp/data.csv")
>>> spark.read.csv("/tmp/data.csv").show()
>>>
>>> prints as below:
>>>
>>> +---+----+
>>> |_c0| _c1|
>>> +---+----+
>>> |  a|7651|
>>> |  a|7651|
>>> |  a|7651|
>>> +---+----+
>>>
>>> ‚Äč
>>>
>>> 2016-08-19 9:27 GMT+09:00 Efe Selcuk <efeman92@gmail.com>:
>>>
>>>> Thanks for the response. The problem with that thought is that I don't
>>>> think I'm dealing with a complex nested type. It's just a dataset where
>>>> every record is a case class with only simple types as fields, strings and
>>>> dates. There's no nesting.
>>>>
>>>> That's what confuses me about how it's interpreting the schema. The
>>>> schema seems to be one complex field rather than a bunch of simple fields.
>>>>
>>>> On Thu, Aug 18, 2016, 5:07 PM Hyukjin Kwon <gurwls223@gmail.com> wrote:
>>>>
>>>>> Hi Efe,
>>>>>
>>>>> If my understanding is correct, supporting to write/read complex types
>>>>> is not supported because CSV format can't represent the nested types
in its
>>>>> own format.
>>>>>
>>>>> I guess supporting them in writing in external CSV is rather a bug.
>>>>>
>>>>> I think it'd be great if we can write and read back CSV in its own
>>>>> format but I guess we can't.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> On 19 Aug 2016 6:33 a.m., "Efe Selcuk" <efeman92@gmail.com> wrote:
>>>>>
>>>>>> We have an application working in Spark 1.6. It uses the databricks
>>>>>> csv library for the output format when writing out.
>>>>>>
>>>>>> I'm attempting an upgrade to Spark 2. When writing with both the
>>>>>> native DataFrameWriter#csv() method and with first specifying the
>>>>>> "com.databricks.spark.csv" format (I suspect underlying format is
the same
>>>>>> but I don't know how to verify), I get the following error:
>>>>>>
>>>>>> java.lang.UnsupportedOperationException: CSV data source does not
>>>>>> support struct<[bunch of field names and types]> data type
>>>>>>
>>>>>> There are 20 fields, mostly plain strings with a couple of dates.
The
>>>>>> source object is a Dataset[T] where T is a case class with various
fields
>>>>>> The line just looks like: someDataset.write.csv(outputPath)
>>>>>>
>>>>>> Googling returned this fairly recent pull request:
>>>>>> https://mail-archives.apache.org/mod_mbox/spark-com
>>>>>> mits/201605.mbox/%3C65d35a72bd05483392857098a2635cc2@git.apa
>>>>>> che.org%3E
>>>>>>
>>>>>> If I'm reading that correctly, the schema shows that each record
has
>>>>>> one field of this complex struct type? And the validation thinks
it's
>>>>>> something that it can't serialize. I would expect the schema to have
a
>>>>>> bunch of fields in it matching the case class, so maybe there's something
>>>>>> I'm misunderstanding.
>>>>>>
>>>>>> Efe
>>>>>>
>>>>>
>>>
>>
>

Mime
View raw message