spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Saisai Shao <sai.sai.s...@gmail.com>
Subject Re: Understanding shuffle file name conflicts
Date Wed, 25 Mar 2015 06:56:40 GMT
Yes as Josh said, when application is started, Spark will create a unique
application-wide folder for related temporary files. And jobs in this
application will have a unique shuffle id with unique file names, so
shuffle stages within app will not meet name conflicts.

Also shuffle files between applications are separated by application
folder, so the name conflicts cannot be happened.

Maybe you changed some parts of the code while do the patch.

Thanks
Jerry


2015-03-25 14:22 GMT+08:00 Josh Rosen <rosenville@gmail.com>:

> Which version of Spark are you using?  What do you mean when you say that
> you used a hardcoded location for shuffle files?
>
> If you look at the current DiskBlockManager code, it looks like it will
> create a per-application subdirectory in each of the local root directories.
>
> Here's the call to create a subdirectory in each root dir:
> https://github.com/apache/spark/blob/c5cc41468e8709d09c09289bb55bc8edc99404b1/core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala#L126
>
> This call to Utils.createDirectory() should result in a fresh subdirectory
> being created for just this application (note the use of random UUIDs, plus
> the check to ensure that the directory doesn't already exist):
>
> https://github.com/apache/spark/blob/c5cc41468e8709d09c09289bb55bc8edc99404b1/core/src/main/scala/org/apache/spark/util/Utils.scala#L273
>
> So, although the filenames for shuffle files are not globally unique,
> their full paths should be unique due to these unique per-application
> subdirectories.  Have you observed an instance where this isn't the case?
>
> - Josh
>
> On Tue, Mar 24, 2015 at 11:04 PM, Kannan Rajah <krajah@maprtech.com>
> wrote:
>
>> Saisai,
>> This is the not the case when I use spark-submit to run 2 jobs, one after
>> another. The shuffle id remains the same.
>>
>>
>> --
>> Kannan
>>
>> On Tue, Mar 24, 2015 at 7:35 PM, Saisai Shao <sai.sai.shao@gmail.com>
>> wrote:
>>
>> > Hi Kannan,
>> >
>> > As I know the shuffle Id in ShuffleDependency will be increased, so even
>> > if you run the same job twice, the shuffle dependency as well as
>> shuffle id
>> > is different, so the shuffle file name which is combined by
>> > (shuffleId+mapId+reduceId) will be changed, so there's no name conflict
>> > even in the same directory as I know.
>> >
>> > Thanks
>> > Jerry
>> >
>> >
>> > 2015-03-25 1:56 GMT+08:00 Kannan Rajah <krajah@maprtech.com>:
>> >
>> >> I am working on SPARK-1529. I ran into an issue with my change, where
>> the
>> >> same shuffle file was being reused across 2 jobs. Please note this only
>> >> happens when I use a hard coded location to use for shuffle files, say
>> >> "/tmp". It does not happen with normal code path that uses
>> >> DiskBlockManager
>> >> to pick different directories for each run. So I want to understand how
>> >> DiskBlockManager guarantees that such a conflict will never happen.
>> >>
>> >> Let's say the shuffle block id has a value of shuffle_0_0_0. So the
>> data
>> >> file name is shuffle_0_0_0.data and index file name is
>> >> shuffle_0_0_0.index.
>> >> If I run a spark job twice, one after another, these files get created
>> >> under different directories because of the hashing logic in
>> >> DiskBlockManager. But the hash is based off the file name, so how are
>> we
>> >> sure that there won't be a conflict ever?
>> >>
>> >> --
>> >> Kannan
>> >>
>> >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message