spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hyukjin Kwon <gurwls...@gmail.com>
Subject Re: [Spark2] Error writing "complex" type to CSV
Date Fri, 19 Aug 2016 00:56:01 GMT
Ah, BTW, there is an issue, SPARK-16216, about printing dates and
timestamps here. So please ignore the integer values for dates

2016-08-19 9:54 GMT+09:00 Hyukjin Kwon <gurwls223@gmail.com>:

> Ah, sorry, I should have read this carefully. Do you mind if I ask your
> codes to test?
>
> I would like to reproduce.
>
>
> I just tested this by myself but I couldn't reproduce as below (is this
> what your doing, right?):
>
> case class ClassData(a: String, b: Date)
>
> val ds: Dataset[ClassData] = Seq(
>   ("a", Date.valueOf("1990-12-13")),
>   ("a", Date.valueOf("1990-12-13")),
>   ("a", Date.valueOf("1990-12-13"))
> ).toDF("a", "b").as[ClassData]
> ds.write.csv("/tmp/data.csv")
> spark.read.csv("/tmp/data.csv").show()
>
> prints as below:
>
> +---+----+
> |_c0| _c1|
> +---+----+
> |  a|7651|
> |  a|7651|
> |  a|7651|
> +---+----+
>
> ‚Äč
>
> 2016-08-19 9:27 GMT+09:00 Efe Selcuk <efeman92@gmail.com>:
>
>> Thanks for the response. The problem with that thought is that I don't
>> think I'm dealing with a complex nested type. It's just a dataset where
>> every record is a case class with only simple types as fields, strings and
>> dates. There's no nesting.
>>
>> That's what confuses me about how it's interpreting the schema. The
>> schema seems to be one complex field rather than a bunch of simple fields.
>>
>> On Thu, Aug 18, 2016, 5:07 PM Hyukjin Kwon <gurwls223@gmail.com> wrote:
>>
>>> Hi Efe,
>>>
>>> If my understanding is correct, supporting to write/read complex types
>>> is not supported because CSV format can't represent the nested types in its
>>> own format.
>>>
>>> I guess supporting them in writing in external CSV is rather a bug.
>>>
>>> I think it'd be great if we can write and read back CSV in its own
>>> format but I guess we can't.
>>>
>>> Thanks!
>>>
>>> On 19 Aug 2016 6:33 a.m., "Efe Selcuk" <efeman92@gmail.com> wrote:
>>>
>>>> We have an application working in Spark 1.6. It uses the databricks csv
>>>> library for the output format when writing out.
>>>>
>>>> I'm attempting an upgrade to Spark 2. When writing with both the native
>>>> DataFrameWriter#csv() method and with first specifying the
>>>> "com.databricks.spark.csv" format (I suspect underlying format is the same
>>>> but I don't know how to verify), I get the following error:
>>>>
>>>> java.lang.UnsupportedOperationException: CSV data source does not
>>>> support struct<[bunch of field names and types]> data type
>>>>
>>>> There are 20 fields, mostly plain strings with a couple of dates. The
>>>> source object is a Dataset[T] where T is a case class with various fields
>>>> The line just looks like: someDataset.write.csv(outputPath)
>>>>
>>>> Googling returned this fairly recent pull request:
>>>> https://mail-archives.apache.org/mod_mbox/spark-
>>>> commits/201605.mbox/%3C65d35a72bd05483392857098a2635cc2@git.
>>>> apache.org%3E
>>>>
>>>> If I'm reading that correctly, the schema shows that each record has
>>>> one field of this complex struct type? And the validation thinks it's
>>>> something that it can't serialize. I would expect the schema to have a
>>>> bunch of fields in it matching the case class, so maybe there's something
>>>> I'm misunderstanding.
>>>>
>>>> Efe
>>>>
>>>
>

Mime
View raw message