spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "t oo (Jira)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-24617) Spark driver not requesting another executor once original executor exits due to 'lost worker'
Date Sun, 01 Mar 2020 16:49:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-24617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17048627#comment-17048627
] 

t oo commented on SPARK-24617:
------------------------------

same problem in spark 2.3.4

> Spark driver not requesting another executor once original executor exits due to 'lost
worker'
> ----------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24617
>                 URL: https://issues.apache.org/jira/browse/SPARK-24617
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 2.1.1
>            Reporter: t oo
>            Priority: Major
>              Labels: bulk-closed
>
> I am running Spark v2.1.1 in 'standalone' mode (no yarn/mesos) across EC2s. I have 1
master ec2 that acts as the driver (since spark-submit is called on this host), spark.master
is setup, deploymode is client (so sparksubmit only returns a ReturnCode to the putty window
once it finishes processing). I have 1 worker ec2 that is registered with the spark master.
When i run sparksubmit on the master, I can see in the WebUI that executors starting on the
worker and I can verify successful completion. However if while the sparksubmit is running
and the worker ec2 gets terminated and then new ec2 worker becomes alive 3mins later and registers
with the master, I have noticed on the webui that it shows 'cannot find address' in the executor
status but the driver keeps waiting forever (2 days later I kill it) or in some cases the
driver allocates tasks to the new worker only 5 hours later and then completes! Is there some
setting i am missing that would explain this behavior?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message