spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Siegmann <daniel.siegm...@velos.io>
Subject Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file
Date Thu, 12 Jun 2014 18:39:45 GMT
The old behavior (A) was dangerous, so it's good that (B) is now the
default. But in some cases I really do want to replace the old data, as per
(C). For example, I may rerun a previous computation (perhaps the input
data was corrupt and I'm rerunning with good input).

Currently I have to write separate code to remove the files before calling
Spark. It would be very convenient if Spark could do this for me. Has
anyone created a JIRA issue to support (C)?


On Mon, Jun 9, 2014 at 3:02 AM, Aaron Davidson <ilikerps@gmail.com> wrote:

> It is not a very good idea to save the results in the exact same place as
> the data. Any failures during the job could lead to corrupted data, because
> recomputing the lost partitions would involve reading the original
> (now-nonexistent) data.
>
> As such, the only "safe" way to do this would be to do as you said, and
> only delete the input data once the entire output has been successfully
> created.
>
>
> On Sun, Jun 8, 2014 at 10:32 PM, innowireless TaeYun Kim <
> taeyun.kim@innowireless.co.kr> wrote:
>
>> Without (C), what is the best practice to implement the following
>> scenario?
>>
>> 1. rdd = sc.textFile(FileA)
>> 2. rdd = rdd.map(...)  // actually modifying the rdd
>> 3. rdd.saveAsTextFile(FileA)
>>
>> Since the rdd transformation is 'lazy', rdd will not materialize until
>> saveAsTextFile(), so FileA must still exist, but it must be deleted before
>> saveAsTextFile().
>>
>> What I can think is:
>>
>> 3. rdd.saveAsTextFile(TempFile)
>> 4. delete FileA
>> 5. rename TempFile to FileA
>>
>> This is not very convenient...
>>
>> Thanks.
>>
>> -----Original Message-----
>> From: Patrick Wendell [mailto:pwendell@gmail.com]
>> Sent: Tuesday, June 03, 2014 11:40 AM
>> To: user@spark.apache.org
>> Subject: Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing
>> file
>>
>> (A) Semantics in Spark 0.9 and earlier: Spark will ignore Hadoo's output
>> format check and overwrite files in the destination directory.
>> But it won't clobber the directory entirely. I.e. if the directory already
>> had "part1" "part2" "part3" "part4" and you write a new job outputing only
>> two files ("part1", "part2") then it would leave the other two files
>> intact,
>> confusingly.
>>
>> (B) Semantics in Spark 1.0 and earlier: Runs Hadoop OutputFormat check
>> which
>> means the directory must not exist already or an excpetion is thrown.
>>
>> (C) Semantics proposed by Nicholas Chammas in this thread (AFAIK):
>> Spark will delete/clobber an existing destination directory if it exists,
>> then fully over-write it with new data.
>>
>> I'm fine to add a flag that allows (B) for backwards-compatibility
>> reasons,
>> but my point was I'd prefer not to have (C) even though I see some cases
>> where it would be useful.
>>
>> - Patrick
>>
>> On Mon, Jun 2, 2014 at 4:25 PM, Sean Owen <sowen@cloudera.com> wrote:
>> > Is there a third way? Unless I miss something. Hadoop's OutputFormat
>> > wants the target dir to not exist no matter what, so it's just a
>> > question of whether Spark deletes it for you or errors.
>> >
>> > On Tue, Jun 3, 2014 at 12:22 AM, Patrick Wendell <pwendell@gmail.com>
>> wrote:
>> >> We can just add back a flag to make it backwards compatible - it was
>> >> just missed during the original PR.
>> >>
>> >> Adding a *third* set of "clobber" semantics, I'm slightly -1 on that
>> >> for the following reasons:
>> >>
>> >> 1. It's scary to have Spark recursively deleting user files, could
>> >> easily lead to users deleting data by mistake if they don't
>> >> understand the exact semantics.
>> >> 2. It would introduce a third set of semantics here for saveAsXX...
>> >> 3. It's trivial for users to implement this with two lines of code
>> >> (if output dir exists, delete it) before calling saveAsHadoopFile.
>> >>
>> >> - Patrick
>> >>
>>
>>
>


-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegmann@velos.io W: www.velos.io

Mime
View raw message