spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dipankar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-15682) Hive ORC partition write looks for root hdfs folder for existence
Date Tue, 31 May 2016 21:39:12 GMT

    [ https://issues.apache.org/jira/browse/SPARK-15682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15308679#comment-15308679
] 

Dipankar commented on SPARK-15682:
----------------------------------

I could make this work by including the safemode = append. 
result_partition.write.format("orc").partitionBy("proc_date").mode("append").save("test.sms_outbound_view_orc")

Looks like , since the partition column value is obtained from the data frame, there is no
way to find statically the partition value.
Hence, root folder is checked. If we add append, it skips this check and would append the
contents if a partition is existent or create new one as the case may be.

However, it DOES NOT update the hive metastore with new partition information!!!

> Hive ORC partition write looks for root hdfs folder for existence
> -----------------------------------------------------------------
>
>                 Key: SPARK-15682
>                 URL: https://issues.apache.org/jira/browse/SPARK-15682
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.6.1
>            Reporter: Dipankar
>
> Scenario:
> I am using the below program to create new partition based on the current date which
signifies the run date.
> However, it fails citing hdfs folder already exists. It checks the root folder and not
new partition value.
> Is partitionBy clause actually not checking the hive metastore or folder till proc_date=
some value. ? and it's just a way to create folders based on partition key. Not any way related
to hive partition ??
> Alternatively, should i use
> result.write.format("orc").save("test.sms_outbound_view_orc/proc_date=2016-05-30") to
achieve the result.
> But this will not update hive metastore with new partition details.
> Is spark orc support not equivalent to HCatStorer API?
> My hive table is built with proc_date as partition column. 
> Source code :
> result.registerTempTable("result_tab")
> val result_partition = sqlContext.sql("FROM result_tab select *,'"+curr_date+"' as proc_date")
> result_partition.write.format("orc").partitionBy("proc_date").save("test.sms_outbound_view_orc")
> Exception
> 16/05/31 15:57:34 INFO ParseDriver: Parsing command: FROM result_tab select *,'2016-05-31'
as proc_date
> 16/05/31 15:57:34 INFO ParseDriver: Parse Completed
> Exception in thread "main" org.apache.spark.sql.AnalysisException: path hdfs://hdpprod/user/dipankar.ghosal/test.sms_outbound_view_orc
already exists.;
> 	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:76)
> 	at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
> 	at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
> 	at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
> 	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
> 	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> 	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
> 	at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
> 	at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
> 	at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
> 	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
> 	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
> 	at SampleApp$.main(SampleApp.scala:31)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message