spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nirav Patel <npa...@xactlycorp.com>
Subject Overwrite only specific partition with hive dynamic partitioning
Date Wed, 01 Aug 2018 19:24:25 GMT
Hi,

I have a hive partition table created using sparkSession. I would like to
insert/overwrite Dataframe data to specific set of partition without
loosing any other partition. In each run I have to update Set of partitions
not just one.

e.g. I have dataframe with bid=1, bid=2, bid=3 in first time and I can
write it  by using

`df.write.mode(SaveMode.Overwrite).partitionBy("bid").parquet(TableBase
Location)`


It generates dirs: bid=1, bid=2, bid=3  inside TableBaseLocation

But next time when I have a dataframe with  bid=1, bid=4 and use same code
above it removes bid=2 and bid=3. in other words I dont get idempotency.

I tried SaveMode.append but that creates duplicate data inside "bid=1"


I read
https://issues.apache.org/jira/browse/SPARK-18183

With that approach it seems like I may have to updated multiple partition
manually for each input partition. That seems like lot of work on every
update. Is there a better way for this?

Can this fix be apply to dataframe based approach as well?

Thanks

-- 


 <http://www.xactlycorp.com/email-click/>

 
<https://www.instagram.com/xactlycorp/>   
<https://www.linkedin.com/company/xactly-corporation>   
<https://twitter.com/Xactly>   <https://www.facebook.com/XactlyCorp>   
<http://www.youtube.com/xactlycorporation>

Mime
View raw message