spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Davidson <>
Subject Re: Spark resilience
Date Tue, 15 Apr 2014 03:12:54 GMT
Launching drivers inside the cluster was a feature added in 0.9, for
standalone cluster mode:

Note the "supervise" flag, which will cause the driver to be restarted if
it fails. This is a rather low-level mechanism which by default will just
cause the whole job to rerun from the beginning. Special recovery would
have to be implemented by hand, via some sort of state checkpointing into a
globally visible storage system (e.g., HDFS), which, for example, Spark
Streaming already does.

Currently, this feature is not supported in YARN or Mesos fine-grained mode.

On Mon, Apr 14, 2014 at 2:08 PM, Manoj Samel <>wrote:

> Could you please elaborate how drivers can be restarted automatically ?
> Thanks,
> On Mon, Apr 14, 2014 at 10:30 AM, Aaron Davidson <>wrote:
>> Master and slave are somewhat overloaded terms in the Spark ecosystem
>> (see the glossary:
>> Are
>> you actually asking about the Spark "driver" and "executors", or the
>> standalone cluster "master" and "workers"?
>> To briefly answer for either possibility:
>> (1) Drivers are not fault tolerant but can be restarted automatically,
>> Executors may be removed at any point without failing the job (though
>> losing an Executor may slow the job significantly), and Executors may be
>> added at any point and will be immediately used.
>> (2) Standalone cluster Masters are fault tolerant and failure will only
>> temporarily stall new jobs from starting or getting new resources, but does
>> not affect currently-running jobs. Workers can fail and will simply cause
>> jobs to lose their current Executors. New Workers can be added at any point.
>> On Mon, Apr 14, 2014 at 11:00 AM, Ian Ferreira <>wrote:
>>> Folks,
>>> I was wondering what the failure support modes where for Spark while
>>> running jobs
>>>    1. What happens when a master fails
>>>    2. What happens when a slave fails
>>>    3. Can you mid job add and remove slaves
>>> Regarding the install on Meso, if I understand correctly the Spark
>>> master is behind a Zookeeper quorum so that isolates the slaves from a
>>> master failure, but what about the masters behind quorum?
>>> Cheers
>>> - Ian

View raw message