spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nirav Patel <npa...@xactlycorp.com>
Subject Re: Saving dataframes with partitionBy: append partitions, overwrite within each
Date Wed, 01 Aug 2018 19:11:21 GMT
Hi Peay,

Have you find better solution yet? I am having same issue.

Following says it works with spark 2.1 onward but only when you use
sqlContext and not Dataframe
https://medium.com/@anuvrat/writing-into-dynamic-partitions-using-spark-2e2b818a007a

Thanks,
Nirav

On Mon, Oct 2, 2017 at 4:37 AM, Pavel Knoblokh <knoblokh@gmail.com> wrote:

> If your processing task inherently processes input data by month you
> may want to "manually" partition the output data by month as well as
> by day, that is to save it with a file name including the given month,
> i.e. "dataset.parquet/month=01". Then you will be able to use the
> overwrite mode with each month partition. Hope this could be of some
> help.
>
> --
> Pavel Knoblokh
>
> On Fri, Sep 29, 2017 at 5:31 PM, peay <peay@protonmail.com> wrote:
> > Hello,
> >
> > I am trying to use
> > data_frame.write.partitionBy("day").save("dataset.parquet") to write a
> > dataset while splitting by day.
> >
> > I would like to run a Spark job  to process, e.g., a month:
> > dataset.parquet/day=2017-01-01/...
> > ...
> >
> > and then run another Spark job to add another month using the same folder
> > structure, getting me
> > dataset.parquet/day=2017-01-01/
> > ...
> > dataset.parquet/day=2017-02-01/
> > ...
> >
> > However:
> > - with save mode "overwrite", when I process the second month, all of
> > dataset.parquet/ gets removed and I lose whatever was already computed
> for
> > the previous month.
> > - with save mode "append", then I can't get idempotence: if I run the
> job to
> > process a given month twice, I'll get duplicate data in all the
> subfolders
> > for that month.
> >
> > Is there a way to do "append in terms of the subfolders from partitionBy,
> > but overwrite within each such partitions? Any help would be appreciated.
> >
> > Thanks!
>
>
>
> --
> Pavel Knoblokh
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

-- 


 <http://www.xactlycorp.com/email-click/>

 
<https://www.instagram.com/xactlycorp/>   
<https://www.linkedin.com/company/xactly-corporation>   
<https://twitter.com/Xactly>   <https://www.facebook.com/XactlyCorp>   
<http://www.youtube.com/xactlycorporation>

Mime
View raw message