spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcelo Vanzin (JIRA)" <>
Subject [jira] [Resolved] (SPARK-12516) Properly handle NM failure situation for Spark on Yarn
Date Tue, 12 Feb 2019 20:44:00 GMT


Marcelo Vanzin resolved SPARK-12516.
    Resolution: Cannot Reproduce

Seems like work-preserving restart fixes the issue; if there's still an issue here, feel free
to reopen.

> Properly handle NM failure situation for Spark on Yarn
> ------------------------------------------------------
>                 Key: SPARK-12516
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 1.6.0
>            Reporter: Saisai Shao
>            Priority: Major
> Failure of NodeManager will make all the executors belong to that NM exit silently.
> Currently in the implementation of YarnSchedulerBackend, driver will receive onDisconnect
event when executor is lost, which will further ask AM to get the lost reason, AM will hold
this query connection until RM report back the status of lost container, and reply back to
driver. In the case of NM failure, RM cannot detect this failure immediately until timeout
(10 mins by default), so the driver query of lost reason will be timed out (120 seconds),
after timed out the executor states in the driver side will be cleaned out, but in the AM
side, this states will still be maintained until NM heartbeat timeout. So this will potentially
introduce some unexpected behaviors:
> ---
> * In the dynamic allocation disabled situation, executor number in the driver side is
less than the number in the AM side after timeout (from 120 seconds to 10 minutes), and cannot
be ramped up to the expected number until RM detect the failure of NM and make the related
containers as complected.
> {quote}
> For example the target executor number is 10, with 5 NMs (each NM has 2 executors). So
when 1 NM is failed, 2 related executors are lost. After driver side query timeout, the executor
number in driver side is 8, but in AM side it is still 10, so AM will not request additional
containers until the number in AM reaches to 8 (after 10 minutes).
> {quote}
> ---
> * When dynamic allocation is enabled, the number of target executor is maintained both
in the driver and AM side and synced between them. The target executor number will be correct
after driver query timeout (120 seconds), but this number is incorrect in the AM side until
NM failure is detected (10 minutes). In such case the actual executor number is less than
the calculated one.
> {quote}
> For example, current target executor number in driver is N, and in AM side is M, so M
- N is the lost number.
> When the executor number needs to ramp up to A, so the actual number will be A - (M -
> When the executor number needs to bring down to B, so the actual number will be max(0,
B - (M - N)). when the actual number of executors is 0, the whole system is hang, will only
be recovered if driver request more resources, or after 10 minutes timeout.
> This can be reproduced by running SparkPi example in the yarn-client mode with follow
> spark.dynamicAllocation.enabled    true
> spark.shuffle.service.enabled      true
> spark.dynamicAllocation.minExecutors 1
> spark.dynamicAllocation.initialExecutors 2
> spark.dynamicAllocation.maxExecutors 3
> In the middle of job, killing one NM which only has executors running.
> {quote}
> ---
> Possbile solutions:
> * Sync the actual executor number from the driver to AM after RPC timeout (120 seconds),
also clean the related states in the AM.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message