spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dongjoon Hyun (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-26836) Columns get switched in Spark SQL using Avro backed Hive table if schema evolves
Date Fri, 08 Feb 2019 08:09:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-26836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16763393#comment-16763393
] 

Dongjoon Hyun commented on SPARK-26836:
---------------------------------------

Thank you for reporting, [~treff7es]. 
Did you see this issue before at Spark 2.3.x with Databricks Avro libraries?

cc [~Gengliang.Wang]

> Columns get switched in Spark SQL using Avro backed Hive table if schema evolves
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-26836
>                 URL: https://issues.apache.org/jira/browse/SPARK-26836
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0
>         Environment: I tested with Hive and HCatalog version of 2.3.4 on Spark 2.4
>            Reporter: Tamas Nemeth
>            Priority: Major
>         Attachments: doctors.avro, doctors_evolved.avro, doctors_evolved.json, original.avsc
>
>
> I have a hive avro table where the avro schema is stored on s3 next to the avro files. 
> In the table definiton the avro.schema.url always points to the latest partition's _schema.avsc
file which is always the lates schema. (Avro schemas are backward and forward compatible in
a table)
> When new data comes in, I always add a new partition where the avro.schema.url properties
also set to the _schema.avsc which was used when it was added and of course I always update
the table avro.schema.url property to the latest one.
> Querying this table works fine until the schema evolves in a way that a new optional
property is added in the middle. 
> When this happens then after the spark sql query the columns in the old partition gets
mixed up and it shows the wrong data for the columns.
> If I query the table with Hive then everything is perfectly fine and it gives me back
the correct columns for the partitions which were created the old schema and for the new which
was created the evolved schema.
>  
> Here is how I could reproduce with the [doctors.avro|https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/data/files/doctors.avro]
example data in sql test suite.
>  # I have created two partition folder:
> {code:java}
> [hadoop@ip-192-168-10-158 hadoop]$ hdfs dfs -ls s3://somelocation/doctors/*/
> Found 2 items
> -rw-rw-rw- 1 hadoop hadoop 418 2019-02-06 12:48 s3://somelocation/doctors
> /dt=2019-02-05/_schema.avsc
> -rw-rw-rw- 1 hadoop hadoop 521 2019-02-06 12:13 s3://somelocation/doctors
> /dt=2019-02-05/doctors.avro
> Found 2 items
> -rw-rw-rw- 1 hadoop hadoop 580 2019-02-06 12:49 s3://somelocation/doctors
> /dt=2019-02-06/_schema.avsc
> -rw-rw-rw- 1 hadoop hadoop 577 2019-02-06 12:13 s3://somelocation/doctors
> /dt=2019-02-06/doctors_evolved.avro{code}
> Here the first partition had data which was created with the schema before evolving and
the second one had the evolved one. (the evolved schema is the same as in your testcase except I
moved the extra_field column to the last from the second and I generated two lines of avro data
with the evolved schema.
>  # I have created a hive table with the following command:
>  
> {code:java}
> CREATE EXTERNAL TABLE `default.doctors`
>  PARTITIONED BY (
>  `dt` string
>  )
>  ROW FORMAT SERDE
>  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
>  WITH SERDEPROPERTIES (
>  'avro.schema.url'='s3://somelocation/doctors/
> /dt=2019-02-06/_schema.avsc')
>  STORED AS INPUTFORMAT
>  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
>  OUTPUTFORMAT
>  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
>  LOCATION
>  's3://somelocation/doctors/'
>  TBLPROPERTIES (
>  'transient_lastDdlTime'='1538130975'){code}
>  
> Here as you can see the table schema url points to the latest schema
> 3. I ran an msck _repair table_ to pick up all the partitions.
> Fyi: If I run my select * query from here then everything is fine and no columns switch
happening.
> 4. Then I changed the first partition's avro.schema.url url to points to the schema
which is under the partition folder (non-evolved one -> s3://somelocation/doctors/
> /dt=2019-02-05/_schema.avsc)
> Then if you ran a _select * from default.spark_test_ then the columns will be mixed up
(on the data below the first name column becomes the extra_field column. I guess because in
the latest schema it is the second column):
>  
> {code:java}
> number,extra_field,first_name,last_name,dt 
> 6,Colin,Baker,null,2019-02-05 
> 3,Jon,Pertwee,null,2019-02-05 
> 4,Tom,Baker,null,2019-02-05 
> 5,Peter,Davison,null,2019-02-05 
> 11,Matt,Smith,null,2019-02-05 
> 1,William,Hartnell,null,2019-02-05 
> 7,Sylvester,McCoy,null,2019-02-05 
> 8,Paul,McGann,null,2019-02-05 
> 2,Patrick,Troughton,null,2019-02-05 
> 9,Christopher,Eccleston,null,2019-02-05 
> 10,David,Tennant,null,2019-02-05 
> 21,fishfinger,Jim,Baker,2019-02-06 
> 24,fishfinger,Bean,Pertwee,2019-02-06
> {code}
> If I try the same query from Hive and not from spark sql then everything is fine and
it never switches the columns.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message