spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Saisai Shao <sai.sai.s...@gmail.com>
Subject Re: Understanding shuffle file name conflicts
Date Wed, 25 Mar 2015 02:35:23 GMT
Hi Kannan,

As I know the shuffle Id in ShuffleDependency will be increased, so even if
you run the same job twice, the shuffle dependency as well as shuffle id is
different, so the shuffle file name which is combined by
(shuffleId+mapId+reduceId) will be changed, so there's no name conflict
even in the same directory as I know.

Thanks
Jerry


2015-03-25 1:56 GMT+08:00 Kannan Rajah <krajah@maprtech.com>:

> I am working on SPARK-1529. I ran into an issue with my change, where the
> same shuffle file was being reused across 2 jobs. Please note this only
> happens when I use a hard coded location to use for shuffle files, say
> "/tmp". It does not happen with normal code path that uses DiskBlockManager
> to pick different directories for each run. So I want to understand how
> DiskBlockManager guarantees that such a conflict will never happen.
>
> Let's say the shuffle block id has a value of shuffle_0_0_0. So the data
> file name is shuffle_0_0_0.data and index file name is shuffle_0_0_0.index.
> If I run a spark job twice, one after another, these files get created
> under different directories because of the hashing logic in
> DiskBlockManager. But the hash is based off the file name, so how are we
> sure that there won't be a conflict ever?
>
> --
> Kannan
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message