spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "S. Zhou" <>
Subject Re: Spark executor lost
Date Thu, 04 Dec 2014 19:00:31 GMT
Here is a sample exception I collected from a spark worker node: (there are many such errors
across over work nodes). It looks to me that spark worker failed to communicate to executor
14/12/04 04:26:37 ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@spark-prod1.XXX:7079]
-> [akka.tcp://sparkExecutor@spark-prod1.XXX:47710]: Error [Association failed with [akka.tcp://sparkExecutor@spark-prod1.XXX:47710]]
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@spark-prod1.XXX:47710]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection
refused: spark-prod1.XXX/10.51.XX.XX:47710


     On Wednesday, December 3, 2014 5:05 PM, Ted Yu <> wrote:

 bq.  to get the logs from the data nodes
Minor correction: the logs are collected from machines where node managers run.
On Wed, Dec 3, 2014 at 3:39 PM, Ganelin, Ilya <> wrote:

You want to look further up the stack (there are almost certainly other errors before this
happens) and those other errors may give your better idea of what is going on. Also if you
are running on yarn you can run "yarn logs -applicationId <yourAppId>" to get the logs
from the data nodes.

Sent with Good (

-----Original Message-----
From: S. Zhou []
Sent: Wednesday, December 03, 2014 06:30 PM Eastern Standard Time
Subject: Spark executor lost

We are using Spark job server to submit spark jobs (our spark version is 0.91). After running
the spark job server for a while, we often see the following errors (executor lost) in the
spark job server log. As a consequence, the spark driver (allocated inside spark job server)
gradually loses executors. And finally the spark job server no longer be able to submit jobs.
We tried to google the solutions but so far no luck. Please help if you have any ideas. Thanks!
[2014-11-25 01:37:36,250] INFO  parkDeploySchedulerBackend [] [akka://JobServer/user/context-supervisor/next-staging]
- Executor 6 disconnected, so removing it[2014-11-25 01:37:36,252] ERROR cheduler.TaskSchedulerImpl
[] [akka://JobServer/user/context-supervisor/next-staging] - Lost executor 6 on XXXX: remote
Akka client disassociated[2014-11-25 01:37:36,252] INFO  ark.scheduler.DAGScheduler [] []
- Executor lost: 6 (epoch 8)[2014-11-25 01:37:36,252] INFO  ge.BlockManagerMasterActor []
[] - Trying to remove executor 6 from BlockManagerMaster.[2014-11-25 01:37:36,252] INFO  storage.BlockManagerMaster
[] [] - Removed 6 successfully in removeExecutor[2014-11-25 01:37:36,286] INFO  ient.AppClient$ClientActor
[] [akka://JobServer/user/context-supervisor/next-staging] - Executor updated: app-20141125002023-0037/6
is now FAILED (Command exited with code 143)

 The information contained in this e-mail is confidential and/or proprietary to Capital One
and/or its affiliates. The information transmitted herewith is intended only for use by the
individual or entity to which it is addressed.  If the reader of this message is not the
intended recipient, you are hereby notified that any review, retransmission, dissemination,
distribution, copying or other use of, or taking of any action in reliance upon this information
is strictly prohibited. If you have received this communication in error, please contact the
sender and delete the material from your computer.

View raw message