spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Rosen (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-14387) Enable Hive-1.x ORC compatibility with spark.sql.hive.convertMetastoreOrc
Date Fri, 23 Sep 2016 17:49:20 GMT

     [ https://issues.apache.org/jira/browse/SPARK-14387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Josh Rosen updated SPARK-14387:
-------------------------------
    Target Version/s: 2.0.2  (was: 2.0.1)

> Enable Hive-1.x ORC compatibility with spark.sql.hive.convertMetastoreOrc
> -------------------------------------------------------------------------
>
>                 Key: SPARK-14387
>                 URL: https://issues.apache.org/jira/browse/SPARK-14387
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Rajesh Balamohan
>
> In master branch, I tried to run TPC-DS queries (e.g Query27) at 200 GB scale. Initially
I got the following exception (as FileScanRDD has been made the default in master branch)
> {noformat}
> 16/04/04 06:49:55 WARN TaskSetManager: Lost task 0.0 in stage 15.0..... java.lang.IllegalArgumentException:
Field "s_store_sk" does not exist.
> at org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236)
> at org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236)
> at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
> at scala.collection.AbstractMap.getOrElse(Map.scala:59)
> at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:235)
> at org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410)
> at org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410)
> at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at org.apache.spark.sql.types.StructType.map(StructType.scala:94)
> at org.apache.spark.sql.hive.orc.OrcRelation$.setRequiredColumns(OrcRelation.scala:410)
> at org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:157)
> at org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:146)
> at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:69)
> at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:60)
> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
> at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at org.apache.spark.sql.execution.WholeStageCodegen$$anonfun$6$$anon$1.hasNext(WholeStageCodegen.scala:361)
> {noformat}
> When running with "spark.sql.sources.fileScan=false", following exception is thrown
> {noformat}
> 16/04/04 09:02:00 ERROR SparkExecuteStatementOperation: Error executing query, currentState
RUNNING,
> java.lang.IllegalArgumentException: Field "cd_demo_sk" does not exist.
>         at org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236)
>         at org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236)
>         at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
>         at scala.collection.AbstractMap.getOrElse(Map.scala:59)
>         at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:235)
>         at org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410)
>         at org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410)
>         at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>         at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>         at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>         at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>         at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>         at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94)
>         at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>         at org.apache.spark.sql.types.StructType.map(StructType.scala:94)
>         at org.apache.spark.sql.hive.orc.OrcRelation$.setRequiredColumns(OrcRelation.scala:410)
>         at org.apache.spark.sql.hive.orc.OrcTableScan.execute(OrcRelation.scala:317)
>         at org.apache.spark.sql.hive.orc.DefaultSource.buildInternalScan(OrcRelation.scala:124)
>         at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$16.apply(DataSourceStrategy.scala:229)
>         at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$16.apply(DataSourceStrategy.scala:228)
>         at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:537)
>         at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:536)
>         at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProjectRaw(DataSourceStrategy.scala:625)
>         at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProject(DataSourceStrategy.scala:532)
>         at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:224)
>         at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>         at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>         at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>         at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>         at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>         at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
>         at org.apache.spark.sql.execution.SparkStrategies$EquiJoinSelection$.apply(SparkStrategies.scala:147)
>         at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>         at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>         at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>         at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> {noformat}
> TPC-DS dataset generator generates column names differently & maintains the mapping
in hive metastore.  This mapping is somehow broken in master causing these exceptions. 
> e.g 
> {noformat}
> Structure for /apps/hive/warehouse/tpcds_bin_partitioned_orc_200.db/catalog_returns/cr_returned_date_sk=2451916/000019_0
> Type: struct<_col0:int,_col1:int,_col2:int,_col3:int,_col4:int,_col5:int,_col6:int,_col7:int,_col8:int,_col9:int,_col10:int,_col11:int,_col12:int,_col13:int,_col14:int,_col15:bigint,_col16:int,_col17:float,_col18:float,_col19:float,_col20:float,_col21:float,_col22:float,_col23:float,_col24:float,_col25:float>
> {noformat}
>  Creating this ticket as this used to work in earlier branches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message