spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabor Somogyi <gabor.g.somo...@gmail.com>
Subject Re: to_avro and from_avro not working with struct type in spark 2.4
Date Fri, 01 Mar 2019 14:54:14 GMT
> I am thinking of writing out the dfKV dataframe to disk and then use Avro
apis to read the data.
Ping me if you have something, I'm planning similar things...


On Thu, Feb 28, 2019 at 5:27 PM Hien Luu <hienluu@gmail.com> wrote:

> Thanks for the answer.
>
> As far as the next step goes, I am thinking of writing out the dfKV
> dataframe to disk and then use Avro apis to read the data.
>
> This smells like a bug somewhere.
>
> Cheers,
>
> Hien
>
> On Thu, Feb 28, 2019 at 4:02 AM Gabor Somogyi <gabor.g.somogyi@gmail.com>
> wrote:
>
>> No, just take a look at the schema of dfStruct since you've converted its
>> value column with to_avro:
>>
>> scala> dfStruct.printSchema
>> root
>>  |-- id: integer (nullable = false)
>>  |-- name: string (nullable = true)
>>  |-- age: integer (nullable = false)
>>  |-- value: struct (nullable = false)
>>  |    |-- name: string (nullable = true)
>>  |    |-- age: integer (nullable = false)
>>
>>
>> On Wed, Feb 27, 2019 at 6:51 PM Hien Luu <hienluu@gmail.com> wrote:
>>
>>> Thanks for looking into this.  Does this mean string fields should alway
>>> be nullable?
>>>
>>> You are right that the result is not yet correct and further digging is
>>> needed :(
>>>
>>> On Wed, Feb 27, 2019 at 1:19 AM Gabor Somogyi <gabor.g.somogyi@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I was dealing with avro stuff lately and most of the time it has
>>>> something to do with the schema.
>>>> One thing I've pinpointed quickly (where I was struggling also) is the
>>>> name field should be nullable but the result is not yet correct so further
>>>> digging needed...
>>>>
>>>> scala> val expectedSchema = StructType(Seq(StructField("name",
>>>> StringType,true),StructField("age", IntegerType, false)))
>>>> expectedSchema: org.apache.spark.sql.types.StructType =
>>>> StructType(StructField(name,StringType,true),
>>>> StructField(age,IntegerType,false))
>>>>
>>>> scala> val avroTypeStruct =
>>>> SchemaConverters.toAvroType(expectedSchema).toString
>>>> avroTypeStruct: String =
>>>> {"type":"record","name":"topLevelRecord","fields":[{"name":"name","type":["string","null"]},{"name":"age","type":"int"}]}
>>>>
>>>> scala> dfKV.select(from_avro('value, avroTypeStruct)).show
>>>> +---------------------------------------------+
>>>> |from_avro(value, struct<name:string,age:int>)|
>>>> +---------------------------------------------+
>>>> |                              [Mary Jane, 25]|
>>>> |                              [Mary Jane, 25]|
>>>> +---------------------------------------------+
>>>>
>>>> BR,
>>>> G
>>>>
>>>>
>>>> On Wed, Feb 27, 2019 at 7:43 AM Hien Luu <hienluu@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I ran into a pretty weird issue with to_avro and from_avro where it
>>>>> was not
>>>>> able to parse the data in a struct correctly.  Please see the simple
>>>>> and
>>>>> self contained example below. I am using Spark 2.4.  I am not sure if
I
>>>>> missed something.
>>>>>
>>>>> This is how I start the spark-shell on my Mac:
>>>>>
>>>>> ./bin/spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.0
>>>>>
>>>>> import org.apache.spark.sql.types._
>>>>> import org.apache.spark.sql.avro._
>>>>> import org.apache.spark.sql.functions._
>>>>>
>>>>>
>>>>> spark.version
>>>>>
>>>>> val df = Seq((1, "John Doe",  30), (2, "Mary Jane", 25)).toDF("id",
>>>>> "name",
>>>>> "age")
>>>>>
>>>>> val dfStruct = df.withColumn("value", struct("name","age"))
>>>>>
>>>>> dfStruct.show
>>>>> dfStruct.printSchema
>>>>>
>>>>> val dfKV = dfStruct.select(to_avro('id).as("key"),
>>>>> to_avro('value).as("value"))
>>>>>
>>>>> val expectedSchema = StructType(Seq(StructField("name", StringType,
>>>>> false),StructField("age", IntegerType, false)))
>>>>>
>>>>> val avroTypeStruct =
>>>>> SchemaConverters.toAvroType(expectedSchema).toString
>>>>>
>>>>> val avroTypeStr = s"""
>>>>>       |{
>>>>>       |  "type": "int",
>>>>>       |  "name": "key"
>>>>>       |}
>>>>>     """.stripMargin
>>>>>
>>>>>
>>>>> dfKV.select(from_avro('key, avroTypeStr)).show
>>>>>
>>>>> // output
>>>>> +-------------------+
>>>>> |from_avro(key, int)|
>>>>> +-------------------+
>>>>> |                  1|
>>>>> |                  2|
>>>>> +-------------------+
>>>>>
>>>>> dfKV.select(from_avro('value, avroTypeStruct)).show
>>>>>
>>>>> // output
>>>>> +---------------------------------------------+
>>>>> |from_avro(value, struct<name:string,age:int>)|
>>>>> +---------------------------------------------+
>>>>> |                                        [, 9]|
>>>>> |                                        [, 9]|
>>>>> +---------------------------------------------+
>>>>>
>>>>> Please help and thanks in advance.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>>>
>>>>>
>>>
>>> --
>>> Regards,
>>>
>>
>
> --
> Regards,
>

Mime
View raw message