spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Makoto Yui <yuin...@gmail.com>
Subject akka disassociated on GC
Date Wed, 16 Jul 2014 05:48:31 GMT
Hello,

(2014/06/19 23:43), Xiangrui Meng wrote:
>> The execution was slow for more large KDD cup 2012, Track 2 dataset (235M+ records
of 16.7M+ (2^24) sparse features in about 33.6GB) due to the sequential aggregation of dense
vectors on a single driver node.
>>
>> It took about 7.6m for aggregation for an iteration.

When running the above test, I got another error at the beginning of the 
2nd iteration when enabling iterations.

It works fine for the first iteration but the 2nd iteration always fails.

It seems that akka connections are suddenly disassociated when GC 
happens on the driver node. Two possible causes can be considered:
1) The driver is under a heavy load because of GC; so executors cannot 
connect to the driver. Changing akka timeout setting did not resolve the 
issue.
2) akka oddly released valid connections on GC.

I'm using spark 1.0.1 and timeout setting of akka as follows did not 
resolve the problem.

[spark-defaults.conf]
spark.akka.frameSize     50
spark.akka.timeout       120
spark.akka.askTimeout    120
spark.akka.lookupTimeout 120
spark.akka.heartbeat.pauses     600

It seems this issue is related to one previously discussed in
http://markmail.org/message/p2i34frtf4iusdfn

Are there any preferred configurations or workaround for this issue?

Thanks,
Makoto

--------------------------------------------
[The error log of the driver]

14/07/14 18:11:32 INFO scheduler.TaskSetManager: Serialized task 4.0:117 
as 25300254 bytes in 35 ms
666.108: [GC [PSYoungGen: 6540914K->975362K(7046784K)] 
12419091K->7792529K(23824000K), 5.2157830 secs] [Times: user=0.00 
sys=68.43, real=5.22 secs]
14/07/14 18:11:38 INFO network.ConnectionManager: Removing 
SendingConnection to ConnectionManagerId(dc09.mydomain.org,34565)
14/07/14 18:11:38 INFO network.ConnectionManager: Removing 
ReceivingConnection to ConnectionManagerId(dc09.mydomain.org,34565)
14/07/14 18:11:38 INFO client.AppClient$ClientActor: Executor updated: 
app-20140714180032-0010/8 is now EXITED (Command exited with code 1)
14/07/14 18:11:38 ERROR network.ConnectionManager: Corresponding 
SendingConnectionManagerId not found
14/07/14 18:11:38 INFO cluster.SparkDeploySchedulerBackend: Executor 
app-20140714180032-0010/8 removed: Command exited with code 1
14/07/14 18:11:38 INFO network.ConnectionManager: Removing 
SendingConnection to ConnectionManagerId(dc30.mydomain.org,59016)
14/07/14 18:11:38 INFO network.ConnectionManager: Removing 
ReceivingConnection to ConnectionManagerId(dc30.mydomain.org,59016)
14/07/14 18:11:38 ERROR network.ConnectionManager: Corresponding 
SendingConnectionManagerId not found
672.596: [GC [PSYoungGen: 6642785K->359202K(6059072K)] 
13459952K->8065935K(22836288K), 2.8260220 secs] [Times: user=2.83 
sys=33.72, real=2.83 secs]
14/07/14 18:11:41 INFO network.ConnectionManager: Removing 
ReceivingConnection to ConnectionManagerId(dc03.mydomain.org,43278)
14/07/14 18:11:41 INFO network.ConnectionManager: Removing 
SendingConnection to ConnectionManagerId(dc03.mydomain.org,43278)
14/07/14 18:11:41 INFO network.ConnectionManager: Removing 
SendingConnection to ConnectionManagerId(dc02.mydomain.org,54538)
14/07/14 18:11:41 INFO network.ConnectionManager: Removing 
ReceivingConnection to ConnectionManagerId(dc18.mydomain.org,58100)
14/07/14 18:11:41 INFO network.ConnectionManager: Removing 
SendingConnection to ConnectionManagerId(dc18.mydomain.org,58100)
14/07/14 18:11:41 INFO network.ConnectionManager: Removing 
SendingConnection to ConnectionManagerId(dc18.mydomain.org,58100)

The full log is uploaded on
https://dl.dropboxusercontent.com/u/13123103/driver.log

--------------------------------------------
[The error log of a worker]
14/07/14 18:11:38 INFO worker.Worker: Executor app-20140714180032-0010/8 
finished with state EXITED message Command exited with code 1 exitStatus 1
14/07/14 18:11:38 INFO actor.LocalActorRef: Message 
[akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] 
from Actor[akka://sparkWorker/deadLetters] to 
Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%4010.0.1.9%3A60601-39#1322474303]

was not delivered. [13] dead letters encountered. This logging can be 
turned off or adjusted with configuration settings 
'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
14/07/14 18:11:38 ERROR remote.EndpointWriter: AssociationError 
[akka.tcp://sparkWorker@dc09.mydomain.org:39578] -> 
[akka.tcp://sparkExecutor@dc09.mydomain.org:33886]: Error [Association 
failed with [akka.tcp://sparkExecutor@dc09.mydomain.org:33886]] [
akka.remote.EndpointAssociationException: Association failed with 
[akka.tcp://sparkExecutor@dc09.mydomain.org:33886]
Caused by: 
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
Connection refused: dc09.mydomain.org/10.0.1.9:33886]
14/07/14 18:11:38 INFO worker.Worker: Asked to launch executor 
app-20140714180032-0010/32 for Spark shell
14/07/14 18:11:38 WARN worker.CommandUtils: SPARK_JAVA_OPTS was set on 
the worker. It is deprecated in Spark 1.0.
14/07/14 18:11:38 WARN worker.CommandUtils: Set SPARK_LOCAL_DIRS for 
node-specific storage locations.
14/07/14 18:11:38 ERROR remote.EndpointWriter: AssociationError 
[akka.tcp://sparkWorker@dc09.mydomain.org:39578] -> 
[akka.tcp://sparkExecutor@dc09.mydomain.org:33886]: Error [Association 
failed with [akka.tcp://sparkExecutor@dc09.mydomain.org:33886]] [
akka.remote.EndpointAssociationException: Association failed with 
[akka.tcp://sparkExecutor@dc09.mydomain.org:33886]
Caused by: 
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
Connection refused: dc09.mydomain.org/10.0.1.9:33886]

Mime
View raw message