spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Cheah <mch...@palantir.com.INVALID>
Subject [SPARK-25299] A Discussion About Shuffle Metadata Tracking
Date Fri, 13 Mar 2020 23:21:37 GMT
Hi everyone,

A working group in the community have been having ongoing discussions regarding how we can
allow for flexible storage solutions for shuffle data that is compatible with containerized
systems, more resilient to node failures, and can support disaggregated storage architectures.

One of the core challenges we have been trying to overcome is navigating the space of shuffle
metadata tracking, and reasoning about how we approach recomputing lost shuffle blocks in
the case when the shuffle file storage system is not resilient.

I have written a design document on the subject, and a proposed set of APIs to fix it. These
should be considered as part of the APIs for SPARK-25299<https://issues.apache.org/jira/browse/SPARK-25299>.
Once we have reached some common conclusion on the proper APIs to build, I can modify the
original SPARK-25299 SPIP to reflect the choices we’ve made. But I wanted to write more
extensively on this topic separately to encourage focused discussion on this subset of the
problem space.

You can find the design document here: https://docs.google.com/document/d/1Aj6IyMsbS2sdIfHxLvIbHUNjHIWHTabfknIPoxOrTjk/edit?usp=sharing

If you would like to catch up on the discussions we have had so far, I give some background
to the subject matter and have linked to other relevant design documents and discussion threads
in this document.

Feedback is definitely appreciated – I acknowledge that this is a fairly complex space with
lots of potential viable options, so I’m looking forward to engaging with dialogue moving
forward.

Thanks!

-Matt Cheah
Mime
View raw message