spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrian Wang (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-1499) Workers continuously produce failing executors
Date Tue, 22 Apr 2014 08:37:14 GMT

    [ https://issues.apache.org/jira/browse/SPARK-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13976294#comment-13976294
] 

Adrian Wang edited comment on SPARK-1499 at 4/22/14 8:35 AM:
-------------------------------------------------------------

Have you looked into the log of the failing worker?
I think there must be a lot of lines like
"
ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@slave1:45324] -> [akka.tcp://sparkExecutor@slave1:59294]:
Error [Association failed with [akka.tcp://sparkExecutor@slave1:59294]] [
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@slave1:59294]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection
refused: slave1/192.168.1.2:59294
"


was (Author: adrian-wang):
Have you look into the log of the failing worker?
I think there must be a lot of lines like
"
ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@slave1:45324] -> [akka.tcp://sparkExecutor@slave1:59294]:
Error [Association failed with [akka.tcp://sparkExecutor@slave1:59294]] [
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@slave1:59294]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection
refused: slave1/192.168.1.2:59294
"

> Workers continuously produce failing executors
> ----------------------------------------------
>
>                 Key: SPARK-1499
>                 URL: https://issues.apache.org/jira/browse/SPARK-1499
>             Project: Spark
>          Issue Type: Bug
>          Components: Deploy, Spark Core
>    Affects Versions: 1.0.0, 0.9.1
>            Reporter: Aaron Davidson
>
> If a node is in a bad state, such that newly started executors fail on startup or first
use, the Standalone Cluster Worker will happily keep spawning new ones. A better behavior
would be for a Worker to mark itself as dead if it has had a history of continuously producing
erroneous executors, or else to somehow prevent a driver from re-registering executors from
the same machine repeatedly.
> Reported on mailing list: http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3CCAL8t0BqJFgtf-Vbzjq6Yj7CKBL_9P9S0tRVEW2MVG6ZBNgxY2g@mail.gmail.com%3E
> Relevant logs: 
> {noformat}
> 14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor updated: app-20140411190649-0008/4
is now FAILED (Command exited with code 53)
> 14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Executor app-20140411190649-0008/4
removed: Command exited with code 53
> 14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Executor 4 disconnected,
so removing it
> 14/04/11 19:06:52 ERROR scheduler.TaskSchedulerImpl: Lost an executor 4 (already removed):
Failed to create local directory (bad spark.local.dir?)
> 14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor added: app-20140411190649-0008/27
on worker-20140409212012-ip-172-31-19-11.us-west-1.compute.internal-58614 (ip-172-31-19-11.us-west-1.compute.internal:58614)
with 8 cores
> 14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20140411190649-0008/27
on hostPort ip-172-31-19-11.us-west-1.compute.internal:58614 with 8 cores, 56.9 GB RAM
> 14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor updated: app-20140411190649-0008/27
is now RUNNING
> 14/04/11 19:06:52 INFO storage.BlockManagerMasterActor$BlockManagerInfo: Registering
block manager ip-172-31-24-76.us-west-1.compute.internal:50256 with 32.7 GB RAM
> 14/04/11 19:06:52 INFO metastore.HiveMetaStore: 0: get_table : db=default tbl=wikistats_pd
> 14/04/11 19:06:52 INFO HiveMetaStore.audit: ugi=root	ip=unknown-ip-addr	cmd=get_table
: db=default tbl=wikistats_pd	
> 14/04/11 19:06:53 DEBUG hive.log: DDL: struct wikistats_pd { string projectcode, string
pagename, i32 pageviews, i32 bytes}
> 14/04/11 19:06:53 DEBUG lazy.LazySimpleSerDe: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
initialized with: columnNames=[projectcode, pagename, pageviews, bytes] columnTypes=[string,
string, int, int] separator=[[B@29a81175] nullstring=\N lastColumnTakesRest=false
> shark> 14/04/11 19:06:55 INFO cluster.SparkDeploySchedulerBackend: Registered executor:
Actor[akka.tcp://sparkExecutor@ip-172-31-19-11.us-west-1.compute.internal:45248/user/Executor#-1002203295]
with ID 27
> show 14/04/11 19:06:56 INFO cluster.SparkDeploySchedulerBackend: Executor 27 disconnected,
so removing it
> 14/04/11 19:06:56 ERROR scheduler.TaskSchedulerImpl: Lost an executor 27 (already removed):
remote Akka client disassociated
> 14/04/11 19:06:56 INFO client.AppClient$ClientActor: Executor updated: app-20140411190649-0008/27
is now FAILED (Command exited with code 53)
> 14/04/11 19:06:56 INFO cluster.SparkDeploySchedulerBackend: Executor app-20140411190649-0008/27
removed: Command exited with code 53
> 14/04/11 19:06:56 INFO client.AppClient$ClientActor: Executor added: app-20140411190649-0008/28
on worker-20140409212012-ip-172-31-19-11.us-west-1.compute.internal-58614 (ip-172-31-19-11.us-west-1.compute.internal:58614)
with 8 cores
> 14/04/11 19:06:56 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20140411190649-0008/28
on hostPort ip-172-31-19-11.us-west-1.compute.internal:58614 with 8 cores, 56.9 GB RAM
> 14/04/11 19:06:56 INFO client.AppClient$ClientActor: Executor updated: app-20140411190649-0008/28
is now RUNNING
> tables;
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message