spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Cheah <mch...@palantir.com>
Subject Understanding fault tolerance in shuffle operations
Date Thu, 10 Mar 2016 23:26:28 GMT
Hi everyone,

I have a question about the shuffle mechanisms in Spark and the fault-tolerance I should expect.
Suppose I have a simple job with two stages  – something like rdd.textFile().mapToPair().reduceByKey().saveAsTextFile().

The questions I have are,

  1.  Suppose I’m not using the external shuffle service. I’m running the job. The first
stage succeeds. During the second stage, one of the executors is lost (for the simplest case,
someone uses kill –9 on it and the job itself should have no problems completing otherwise).
Should I expect the job to be able to recover and complete successfully? My understanding
is that the lost shuffle files from that executor can still be re-computed and the job should
be able to complete successfully.
  2.  Suppose I’m using the shuffle service. How does this change the result of question
#1?
  3.  Suppose I’m using the shuffle service, and I’m using standalone mode. The first
stage succeeds. During the second stage, I kill both the executor and the worker that spawned
that executor. Now that the shuffle files associated with that worker’s shuffle service
daemon have been lost, will the job be able to recompute the lost shuffle data? This is the
scenario I’m running into most, where my tasks fail because they try to reach the shuffle
service instead of trying to recompute the lost shuffle files.

Thanks,

-Matt Cheah
Mime
View raw message