spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nikita Gorbachevski (Jira)" <j...@apache.org>
Subject [jira] [Created] (SPARK-29475) Executors aren't marked as dead on OutOfMemoryError in YARN mode
Date Tue, 15 Oct 2019 11:05:00 GMT
Nikita Gorbachevski created SPARK-29475:
-------------------------------------------

             Summary: Executors aren't marked as dead on OutOfMemoryError in YARN mode
                 Key: SPARK-29475
                 URL: https://issues.apache.org/jira/browse/SPARK-29475
             Project: Spark
          Issue Type: Bug
          Components: DStreams, Spark Core, YARN
    Affects Versions: 2.3.0
            Reporter: Nikita Gorbachevski


My spark-streaming application runs in yarn cluster mode with ``spark.streaming.concurrentJobs``
set to 50. Once i observed that lots of batches were scheduled and application did not make
any progress.

Thread dump showed that all the streaming threads are blocked, infinitely waiting for result
from executor on ``ThreadUtils.awaitReady(waiter.completionFuture, *Duration.Inf*)``.

 
{code:java}
"streaming-job-executor-11" - Thread t@324
   java.lang.Thread.State: WAITING
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for <7fd76f11> (a scala.concurrent.impl.Promise$CompletionLatch)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
        at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153)
        at org.apache.spark.util.ThreadUtils$.awaitReady(ThreadUtils.scala:222)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:633)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099){code}
 

My tasks are short running and pretty simple, e.g. read raw data from Kafka, normalize it
and write back to Kafka.

After logs analysis i noticed that there were lots of ``RpcTimeoutException`` on both executor

 
{code:java}
driver-heartbeater WARN executor.Executor: Issue communicating with driver in heartbeater
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout
is controlled by spark.executor.heartbeatInterval
	at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
{code}
and driver sides during context clearing.

 

 
{code:java}
WARN storage.BlockManagerMaster: Failed to remove RDD 583574 - Cannot receive any reply in
120 seconds. This timeout is controlled by spark.rpc.askTimeout WARN storage.BlockManagerMaster:
Failed to remove RDD 583574 - Cannot receive any reply in 120 seconds. This timeout is controlled
by spark.rpc.askTimeoutorg.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply
from bi-prod-hadoop-17.va2:25109 in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
{code}
Also the only error on executors was

 

 
{code:java}
SIGTERM handler ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
{code}
no exceptions at all.

After digging into YARN logs i noticed that executors were killed by AM because of high memory
usage. 

Also there were no any logs on driver side saying that executors were lost. Thus seems that
driver wasn't notified about this.

Unfortunately i can't find a line of code in CoarseGrainedExecutorBackend which logs such
message. ``exitExecutor`` method looks similar but it's message should look differently.

However i believe that driver is notified that executor is lost via async rpc call, but if
executor encounters issues with rpc because of high GC pressure this message won't be send.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message