spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dongjoon Hyun (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (SPARK-19430) Cannot read external tables with VARCHAR columns if they're backed by ORC files written by Hive 1.2.1
Date Mon, 09 Oct 2017 20:15:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-19430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dongjoon Hyun resolved SPARK-19430.
-----------------------------------
    Resolution: Duplicate

This is resolved via SPARK-19459 in Spark 2.2.

> Cannot read external tables with VARCHAR columns if they're backed by ORC files written
by Hive 1.2.1
> -----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-19430
>                 URL: https://issues.apache.org/jira/browse/SPARK-19430
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.3, 2.0.2, 2.1.0
>            Reporter: Sameer Agarwal
>
> Spark throws an exception when trying to read external tables with VARCHAR columns if
they're backed by ORC files that were written by Hive 1.2.1 (and possibly other versions of
hive).
> Steps to reproduce (credits to [~lian cheng]):
> # Write an ORC table using Hive 1.2.1 with
>    {noformat}
> CREATE TABLE orc_varchar_test STORED AS ORC
> AS SELECT CASTE('a' AS VARCHAR(10)) AS c0{noformat}
> # Get the raw path of the written ORC file
> # Create an external table pointing to this file and read the table using Spark
>   {noformat}
> val path = "/tmp/orc_varchar_test"
> sql(s"create external table if not exists test (c0 varchar(10)) stored as orc location
'$path'")
> spark.table("test").show(){noformat}
> The problem here is that the metadata in the ORC file written by Hive is different from
those written by Spark. We can inspect the ORC file written above:
> {noformat}
> $ hive --orcfiledump file:///Users/lian/local/var/lib/hive/warehouse_1.2.1/orc_varchar_test/000000_0
> Structure for file:///Users/lian/local/var/lib/hive/warehouse_1.2.1/orc_varchar_test/000000_0
> File Version: 0.12 with HIVE_8732
> Rows: 1
> Compression: ZLIB
> Compression size: 262144
> Type: struct<_col0:varchar(10)>       <----
> ...
> {noformat}
> On the other hand, if you create an ORC table using the same DDL and inspect the written
ORC file, you'll see:
> {noformat}
> ...
> Type: struct<c0:string>
> ...
> {noformat}
> Note that all tests are done with {{spark.sql.hive.convertMetastoreOrc}} set to {{false}},
which is the default case.
> I've verified that Spark 1.6.x, 2.0.x and 2.1.x all fail with instances of the following
error:
> {code}
> java.lang.ClassCastException: org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot
be cast to org.apache.hadoop.io.Text
>     at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41)
>     at org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$23.apply(HiveInspectors.scala:529)
>     at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
>     at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
>     at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:435)
>     at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426)
>     at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>     at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>     at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
Source)
>     at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
>     at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>     at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>     at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
>     at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>     at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>     at org.apache.spark.scheduler.Task.run(Task.scala:99)
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>     at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message