spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From peay <p...@protonmail.com>
Subject Saving dataframes with partitionBy: append partitions, overwrite within each
Date Fri, 29 Sep 2017 14:31:02 GMT
Hello,

I am trying to use data_frame.write.partitionBy("day").save("dataset.parquet") to write a
dataset while splitting by day.

I would like to run a Spark job  to process, e.g., a month:
dataset.parquet/day=2017-01-01/...
...

and then run another Spark job to add another month using the same folder structure, getting
me
dataset.parquet/day=2017-01-01/
...
[dataset.parquet/day=2017-02-01/](http://dataset.parquet/day=2017-01-01/)
...

However:
- with save mode "overwrite", when I process the second month, all of dataset.parquet/ gets
removed and I lose whatever was already computed for the previous month.
- with save mode "append", then I can't get idempotence: if I run the job to process a given
month twice, I'll get duplicate data in all the subfolders for that month.

Is there a way to do "append in terms of the subfolders from partitionBy, but overwrite within
each such partitions? Any help would be appreciated.

Thanks!
Mime
View raw message