spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pedro Rodriguez <ski.rodrig...@gmail.com>
Subject Re: Spark SQL overwrite/append for partitioned tables
Date Mon, 25 Jul 2016 23:37:24 GMT
Probably should have been more specific with the code we are using, which
is something like

val df = ....
df.write.mode("append or overwrite
here").partitionBy("date").saveAsTable("my_table")

Unless there is something like what I described on the native API, I will
probably take the approach of having a S3 API call to wipe out that
partition before the job starts, but it would be nice to not have to
incorporate another step in the job.

Pedro

On Mon, Jul 25, 2016 at 5:23 PM, RK Aduri <rkaduri@collectivei.com> wrote:

> You can have a temporary file to capture the data that you would like to
> overwrite. And swap that with existing partition that you would want to
> wipe the data away. Swapping can be done by simple rename of the partition
> and just repair the table to pick up the new partition.
>
> Am not sure if that addresses your scenario.
>
> On Jul 25, 2016, at 4:18 PM, Pedro Rodriguez <ski.rodriguez@gmail.com>
> wrote:
>
> What would be the best way to accomplish the following behavior:
>
> 1. There is a table which is partitioned by date
> 2. Spark job runs on a particular date, we would like it to wipe out all
> data for that date. This is to make the job idempotent and lets us rerun a
> job if it failed without fear of duplicated data
> 3. Preserve data for all other dates
>
> I am guessing that overwrite would not work here or if it does its not
> guaranteed to stay that way, but am not sure. If thats the case, is there a
> good/robust way to get this behavior?
>
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>
>
> Collective[i] dramatically improves sales and marketing performance using
> technology, applications and a revolutionary network designed to provide
> next generation analytics and decision-support directly to business users.
> Our goal is to maximize human potential and minimize mistakes. In most
> cases, the results are astounding. We cannot, however, stop emails from
> sometimes being sent to the wrong person. If you are not the intended
> recipient, please notify us by replying to this email's sender and deleting
> it (and any attachments) permanently from your system. If you are, please
> respect the confidentiality of this communication's contents.




-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience

Mime
View raw message