spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akhil Das <ak...@sigmoidanalytics.com>
Subject Re: Spark executor lost
Date Fri, 05 Dec 2014 07:05:35 GMT
It says connection refused, just make sure the network is configured
properly (open the ports between master and the worker nodes). If the ports
are configured correctly, then i assume the process is getting killed for
some reason and hence connection refused.

Thanks
Best Regards

On Fri, Dec 5, 2014 at 12:30 AM, S. Zhou <myxjtu@yahoo.com.invalid> wrote:

> Here is a sample exception I collected from a spark worker node: (there
> are many such errors across over work nodes). It looks to me that spark
> worker failed to communicate to executor locally.
>
> 14/12/04 04:26:37 ERROR EndpointWriter: AssociationError
> [akka.tcp://sparkWorker@spark-prod1.XXX:7079] ->
> [akka.tcp://sparkExecutor@spark-prod1.XXX:47710]: Error [Association
> failed with [akka.tcp://sparkExecutor@spark-prod1.XXX:47710]] [
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://sparkExecutor@spark-prod1.XXX:47710]
> Caused by:
> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> Connection refused: spark-prod1.XXX/10.51.XX.XX:47710
>
>
>
>   On Wednesday, December 3, 2014 5:05 PM, Ted Yu <yuzhihong@gmail.com>
> wrote:
>
>
> bq.  to get the logs from the data nodes
>
> Minor correction: the logs are collected from machines where node managers
> run.
>
> Cheers
>
> On Wed, Dec 3, 2014 at 3:39 PM, Ganelin, Ilya <Ilya.Ganelin@capitalone.com
> > wrote:
>
>  You want to look further up the stack (there are almost certainly other
> errors before this happens) and those other errors may give your better
> idea of what is going on. Also if you are running on yarn you can run "yarn
> logs -applicationId <yourAppId>" to get the logs from the data nodes.
>
>
>
> Sent with Good (www.good.com)
>
>
> -----Original Message-----
> *From: *S. Zhou [myxjtu@yahoo.com.INVALID]
> *Sent: *Wednesday, December 03, 2014 06:30 PM Eastern Standard Time
> *To: *user@spark.apache.org
> *Subject: *Spark executor lost
>
>  We are using Spark job server to submit spark jobs (our spark version is
> 0.91). After running the spark job server for a while, we often see the
> following errors (executor lost) in the spark job server log. As a
> consequence, the spark driver (allocated inside spark job server) gradually
> loses executors. And finally the spark job server no longer be able to
> submit jobs. We tried to google the solutions but so far no luck. Please
> help if you have any ideas. Thanks!
>
> [2014-11-25 01:37:36,250] INFO  parkDeploySchedulerBackend []
> [akka://JobServer/user/context-supervisor/next-staging] - Executor 6
> disconnected, so removing it
> [2014-11-25 01:37:36,252] ERROR cheduler.TaskSchedulerImpl []
> [akka://JobServer/user/context-supervisor/next-staging] - Lost executor 6
> on XXXX: remote Akka client disassociated
> [2014-11-25 01:37:36,252] INFO  ark.scheduler.DAGScheduler [] [] - *Executor
> lost*: 6 (epoch 8)
> [2014-11-25 01:37:36,252] INFO  ge.BlockManagerMasterActor [] [] - Trying
> to remove executor 6 from BlockManagerMaster.
> [2014-11-25 01:37:36,252] INFO  storage.BlockManagerMaster [] [] - Removed
> 6 successfully in removeExecutor
> [2014-11-25 01:37:36,286] INFO  ient.AppClient$ClientActor []
> [akka://JobServer/user/context-supervisor/next-staging] - Executor updated:
> app-20141125002023-0037/6 is now FAILED (Command exited with code 143)
>
>
>
> ------------------------------
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed.  If the reader of this message is not the
> intended recipient, you are hereby notified that any review,
> retransmission, dissemination, distribution, copying or other use of, or
> taking of any action in reliance upon this information is strictly
> prohibited. If you have received this communication in error, please
> contact the sender and delete the material from your computer.
>
>
>
>
>

Mime
View raw message