spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xin Wu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions
Date Sun, 01 May 2016 03:20:12 GMT

    [ https://issues.apache.org/jira/browse/SPARK-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15265591#comment-15265591
] 

Xin Wu commented on SPARK-14927:
--------------------------------

Since Spark 2.0.0 has moved around a lot of stuff, including splitting the HiveMetaStoreCatalog
into 2 files for resolving and creating tables, respectively, I would try this on Spark 2.0.0.


{code}scala> spark.sql("create database if not exists tmp")
16/04/30 19:59:12 WARN ObjectStore: Failed to get database tmp, returning NoSuchObjectException
res23: org.apache.spark.sql.DataFrame = []

scala> df.write.partitionBy("year").mode(SaveMode.Append).saveAsTable("tmp.tmp1")
16/04/30 19:59:50 WARN CreateDataSourceTableUtils: Persisting partitioned data source relation
`tmp`.`tmp1` into Hive metastore in Spark SQL specific format, which is NOT compatible with
Hive. Input path(s): 
file:/home/xwu0226/spark/spark-warehouse/tmp.db/tmp1

scala> spark.sql("select * from tmp.tmp1").show
+---+----+
|val|year|
+---+----+
|  a|2012|
+---+----+
{code}

For datasource table creation as above, SparkSQL will create the table as a hive internal
table but not compatible with hive. SparkSQL puts partition column information (actually including
also other things like column schema, bucket/sort columns) into serdeInfo.parameters. When
querying the table, SparkSQL resolve the table and parse the information back from serdeInfo.parameters.


Spark 2.0.0 does not pass this command to Hive anymore (actually most of DDL commands are
run natively in SparkSQL now), so when doing "SHOW PARTITIONS...", the command now does not
support showing partitions for datasource table. 

{code}
scala> spark.sql("show partitions tmp.tmp1").show
org.apache.spark.sql.AnalysisException: SHOW PARTITIONS is not allowed on a datasource table:
tmp.tmp1;
  at org.apache.spark.sql.execution.command.ShowPartitionsCommand.run(commands.scala:196)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:62)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:60)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:113)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:113)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:132)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:129)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:112)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
  at org.apache.spark.sql.Dataset.<init>(Dataset.scala:186)
  at org.apache.spark.sql.Dataset.<init>(Dataset.scala:167)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:529)
  ... 48 elided
{code}

Hope this helps. 

> DataFrame. saveAsTable creates RDD partitions but not Hive partitions
> ---------------------------------------------------------------------
>
>                 Key: SPARK-14927
>                 URL: https://issues.apache.org/jira/browse/SPARK-14927
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.2, 1.6.1
>         Environment: Mac OS X 10.11.4 local
>            Reporter: Sasha Ovsankin
>
> This is a followup to http://stackoverflow.com/questions/31341498/save-spark-dataframe-as-dynamic-partitioned-table-in-hive
. I tried to use suggestions in the answers but couldn't make it to work in Spark 1.6.1
> I am trying to create partitions programmatically from `DataFrame. Here is the relevant
code (adapted from a Spark test):
>     hc.setConf("hive.metastore.warehouse.dir", "tmp/tests")
>     //    hc.setConf("hive.exec.dynamic.partition", "true")
>     //    hc.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
>     hc.sql("create database if not exists tmp")
>     hc.sql("drop table if exists tmp.partitiontest1")
>     Seq(2012 -> "a").toDF("year", "val")
>       .write
>       .partitionBy("year")
>       .mode(SaveMode.Append)
>       .saveAsTable("tmp.partitiontest1")
>     hc.sql("show partitions tmp.partitiontest1").show
> Full file is here: https://gist.github.com/SashaOv/7c65f03a51c7e8f9c9e018cd42aa4c4a
> I get the error that the table is not partitioned:
>     ======================
>     HIVE FAILURE OUTPUT
>     ======================
>     SET hive.support.sql11.reserved.keywords=false
>     SET hive.metastore.warehouse.dir=tmp/tests
>     OK
>     OK
>     FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask.
Table tmp.partitiontest1 is not a partitioned table
>     ======================
> It looks like the root cause is that `org.apache.spark.sql.hive.HiveMetastoreCatalog.newSparkSQLSpecificMetastoreTable`
always creates table with empty partitions.
> Any help to move this forward is appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message