spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yash Sharma <yash...@gmail.com>
Subject Re: Spark SQL overwrite/append for partitioned tables
Date Tue, 26 Jul 2016 00:59:38 GMT
Correction -
dataDF.write.partitionBy(“year”, “month”,
“date”).mode(SaveMode.Append).text(“s3://data/test2/events/”)

On Tue, Jul 26, 2016 at 10:59 AM, Yash Sharma <yash360@gmail.com> wrote:

> Based on the behavior of spark [1], Overwrite mode will delete all your
> data when you try to overwrite a particular partition.
>
> What I did-
> - Use S3 api to delete all partitions
> - Use spark df to write in Append mode [2]
>
>
> 1.
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-deletes-all-existing-partitions-in-SaveMode-Overwrite-Expected-behavior-td18219.html
>
> 2. dataDF.write.partitionBy(“year”, “month”,
> “date”).mode(SaveMode.Overwrite).text(“s3://data/test2/events/”)
>
> On Tue, Jul 26, 2016 at 9:37 AM, Pedro Rodriguez <ski.rodriguez@gmail.com>
> wrote:
>
>> Probably should have been more specific with the code we are using, which
>> is something like
>>
>> val df = ....
>> df.write.mode("append or overwrite
>> here").partitionBy("date").saveAsTable("my_table")
>>
>> Unless there is something like what I described on the native API, I will
>> probably take the approach of having a S3 API call to wipe out that
>> partition before the job starts, but it would be nice to not have to
>> incorporate another step in the job.
>>
>> Pedro
>>
>> On Mon, Jul 25, 2016 at 5:23 PM, RK Aduri <rkaduri@collectivei.com>
>> wrote:
>>
>>> You can have a temporary file to capture the data that you would like to
>>> overwrite. And swap that with existing partition that you would want to
>>> wipe the data away. Swapping can be done by simple rename of the partition
>>> and just repair the table to pick up the new partition.
>>>
>>> Am not sure if that addresses your scenario.
>>>
>>> On Jul 25, 2016, at 4:18 PM, Pedro Rodriguez <ski.rodriguez@gmail.com>
>>> wrote:
>>>
>>> What would be the best way to accomplish the following behavior:
>>>
>>> 1. There is a table which is partitioned by date
>>> 2. Spark job runs on a particular date, we would like it to wipe out all
>>> data for that date. This is to make the job idempotent and lets us rerun a
>>> job if it failed without fear of duplicated data
>>> 3. Preserve data for all other dates
>>>
>>> I am guessing that overwrite would not work here or if it does its not
>>> guaranteed to stay that way, but am not sure. If thats the case, is there a
>>> good/robust way to get this behavior?
>>>
>>> --
>>> Pedro Rodriguez
>>> PhD Student in Distributed Machine Learning | CU Boulder
>>> UC Berkeley AMPLab Alumni
>>>
>>> ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
>>> Github: github.com/EntilZha | LinkedIn:
>>> https://www.linkedin.com/in/pedrorodriguezscience
>>>
>>>
>>>
>>> Collective[i] dramatically improves sales and marketing performance
>>> using technology, applications and a revolutionary network designed to
>>> provide next generation analytics and decision-support directly to business
>>> users. Our goal is to maximize human potential and minimize mistakes. In
>>> most cases, the results are astounding. We cannot, however, stop emails
>>> from sometimes being sent to the wrong person. If you are not the intended
>>> recipient, please notify us by replying to this email's sender and deleting
>>> it (and any attachments) permanently from your system. If you are, please
>>> respect the confidentiality of this communication's contents.
>>
>>
>>
>>
>> --
>> Pedro Rodriguez
>> PhD Student in Distributed Machine Learning | CU Boulder
>> UC Berkeley AMPLab Alumni
>>
>> ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
>> Github: github.com/EntilZha | LinkedIn:
>> https://www.linkedin.com/in/pedrorodriguezscience
>>
>>
>

Mime
View raw message