spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kannan Rajah <>
Subject Understanding shuffle file name conflicts
Date Tue, 24 Mar 2015 17:56:26 GMT
I am working on SPARK-1529. I ran into an issue with my change, where the
same shuffle file was being reused across 2 jobs. Please note this only
happens when I use a hard coded location to use for shuffle files, say
"/tmp". It does not happen with normal code path that uses DiskBlockManager
to pick different directories for each run. So I want to understand how
DiskBlockManager guarantees that such a conflict will never happen.

Let's say the shuffle block id has a value of shuffle_0_0_0. So the data
file name is and index file name is shuffle_0_0_0.index.
If I run a spark job twice, one after another, these files get created
under different directories because of the hashing logic in
DiskBlockManager. But the hash is based off the file name, so how are we
sure that there won't be a conflict ever?


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message