spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koert Kuipers <ko...@tresata.com>
Subject Re: Saving dataframes with partitionBy: append partitions, overwrite within each
Date Wed, 01 Aug 2018 23:18:20 GMT
this works for dataframes with spark 2.3 by changing a global setting, and
will be configurable per write in 2.4
see:
https://issues.apache.org/jira/browse/SPARK-20236
https://issues.apache.org/jira/browse/SPARK-24860

On Wed, Aug 1, 2018 at 3:11 PM, Nirav Patel <npatel@xactlycorp.com> wrote:

> Hi Peay,
>
> Have you find better solution yet? I am having same issue.
>
> Following says it works with spark 2.1 onward but only when you use
> sqlContext and not Dataframe
> https://medium.com/@anuvrat/writing-into-dynamic-partitions-using-spark-
> 2e2b818a007a
>
> Thanks,
> Nirav
>
> On Mon, Oct 2, 2017 at 4:37 AM, Pavel Knoblokh <knoblokh@gmail.com> wrote:
>
>> If your processing task inherently processes input data by month you
>> may want to "manually" partition the output data by month as well as
>> by day, that is to save it with a file name including the given month,
>> i.e. "dataset.parquet/month=01". Then you will be able to use the
>> overwrite mode with each month partition. Hope this could be of some
>> help.
>>
>> --
>> Pavel Knoblokh
>>
>> On Fri, Sep 29, 2017 at 5:31 PM, peay <peay@protonmail.com> wrote:
>> > Hello,
>> >
>> > I am trying to use
>> > data_frame.write.partitionBy("day").save("dataset.parquet") to write a
>> > dataset while splitting by day.
>> >
>> > I would like to run a Spark job  to process, e.g., a month:
>> > dataset.parquet/day=2017-01-01/...
>> > ...
>> >
>> > and then run another Spark job to add another month using the same
>> folder
>> > structure, getting me
>> > dataset.parquet/day=2017-01-01/
>> > ...
>> > dataset.parquet/day=2017-02-01/
>> > ...
>> >
>> > However:
>> > - with save mode "overwrite", when I process the second month, all of
>> > dataset.parquet/ gets removed and I lose whatever was already computed
>> for
>> > the previous month.
>> > - with save mode "append", then I can't get idempotence: if I run the
>> job to
>> > process a given month twice, I'll get duplicate data in all the
>> subfolders
>> > for that month.
>> >
>> > Is there a way to do "append in terms of the subfolders from
>> partitionBy,
>> > but overwrite within each such partitions? Any help would be
>> appreciated.
>> >
>> > Thanks!
>>
>>
>>
>> --
>> Pavel Knoblokh
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
>
>
>
> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>
> <https://www.instagram.com/xactlycorp/>
> <https://www.linkedin.com/company/xactly-corporation>
> <https://twitter.com/Xactly>   <https://www.facebook.com/XactlyCorp>
> <http://www.youtube.com/xactlycorporation>

Mime
View raw message