spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From yukang chen <cykhad...@gmail.com>
Subject Re: worker keeps getting disassociated upon a failed job spark version 0.90
Date Mon, 17 Mar 2014 07:08:42 GMT
I have met the same problem on spark 0.9. Master lost all of the workers,
because the work's heartbeat is timeout. And master show "Registering
worker 10.2.6.134:56158 with 24 cores, 32.0 GB RAM" . But master didn't add
restarted workerid to workerset.


On Thu, Feb 27, 2014 at 8:14 AM, Shirish <shirish.kumar@gmail.com> wrote:

> I am an newbie!! I am running Spark 0.90 in standalone mode on my mac.  The
> master and worker run on the same machine.  Both of them startup fine (at
> least that is what I see in the log).
>
> *Upon start-up master log is:*
>
> 14/02/26 15:38:08 INFO Slf4jLogger: Slf4jLogger started
> 14/02/26 15:38:08 INFO Remoting: Starting remoting
> 14/02/26 15:38:08 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://sparkMaster@Shirishs-MacBook-Pro.local:7077]
> 14/02/26 15:38:08 INFO Master: Starting Spark master at
> spark://Shirishs-MacBook-Pro.local:7077
> 14/02/26 15:38:08 INFO MasterWebUI: Started Master web UI at
> http://192.168.1.106:8080
> 14/02/26 15:38:08 INFO Master: I have been elected leader! New state: ALIVE
> 14/02/26 15:38:22 INFO Master: Registering worker
> Shirishs-MacBook-Pro.local:56830 with 4 cores, 15.0 GB RAM
>
> *and the worker log is:*
>
> 14/02/26 15:38:21 INFO Slf4jLogger: Slf4jLogger started
> 14/02/26 15:38:21 INFO Remoting: Starting remoting
> 14/02/26 15:38:21 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://sparkWorker@192.168.1.106:56830]
> 14/02/26 15:38:21 INFO Worker: Starting Spark worker 192.168.1.106:56830
> with 4 cores, 15.0 GB RAM
> 14/02/26 15:38:21 INFO Worker: Spark home:
> /Users/shirish_kumar/Developer/spark-0.9.0-incubating14/02/26 15:38:22 INFO
> WorkerWebUI: Started Worker web UI at http://192.168.1.106:808114/02/26
> 15:38:22 INFO Worker: Connecting to master
> spark://Shirishs-MacBook-Pro.local:7077...14/02/26 15:38:22 INFO Worker:
> Successfully registered with master spark://Shirishs-MacBook-Pro.local:7077
>
> When I launch my job using:
>
> ./bin/spark-class org.apache.spark.deploy.Client launch
> spark://Shirishs-MacBook-Pro.local:7077
>
> file:///Users/shirish_kumar/Developer/spark_app/SimpleApp/target/scala-2.10/simple-project_2.10-1.0.jar
> SimpleApp
>
> *Here is what I see in the master log:*
>
> 14/02/26 15:38:36 INFO Master: Driver submitted
> org.apache.spark.deploy.worker.DriverWrapper14/02/26 15:38:36 INFO Master:
> Launching driver driver-20140226153836-0000 on worker
> worker-20140226153821-192.168.1.106-56830
> 14/02/26 15:38:39 INFO Master: Registering worker
> Shirishs-MacBook-Pro.local:56830 with 4 cores, 15.0 GB RAM
> 14/02/26 15:38:39 INFO Master: Attempted to re-register worker at same
> address: akka.tcp://sparkWorker@192.168.1.106:56830
> 14/02/26 15:38:39 WARN Master: Got heartbeat from unregistered worker
> worker-20140226153839-192.168.1.106-56830
> 14/02/26 15:38:42 INFO Master: akka.tcp://driverClient@192.168.1.106:56834
> got disassociated, removing it.
> 14/02/26 15:38:42 INFO Master: akka.tcp://driverClient@192.168.1.106:56834
> got disassociated, removing it.
> 14/02/26 15:38:42 INFO LocalActorRef: Message
> [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
> Actor[akka://sparkMaster/deadLetters] to
>
> Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%40192.168.1.106%3A56835-2#330912359]
> was not delivered. [1] dead letters encountered. This logging can be turned
> off or adjusted with configuration settings 'akka.log-dead-letters' and
> 'akka.log-dead-letters-during-shutdown'.
> 14/02/26 15:38:42 ERROR EndpointWriter: AssociationError
> [akka.tcp://sparkMaster@Shirishs-MacBook-Pro.local:7077] ->
> [akka.tcp://driverClient@192.168.1.106:56834]: Error [Association failed
> with [akka.tcp://driverClient@192.168.1.106:56834]] [
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://driverClient@192.168.1.106:56834]
> Caused by:
> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> Connection refused: /192.168.1.106:56834
> ]
> 14/02/26 15:38:42 INFO Master: akka.tcp://driverClient@192.168.1.106:56834
> got disassociated, removing it.
> 14/02/26 15:38:42 INFO Master: akka.tcp://driverClient@192.168.1.106:56834
> got disassociated, removing it.
> 14/02/26 15:38:42 ERROR EndpointWriter: AssociationError
> [akka.tcp://sparkMaster@Shirishs-MacBook-Pro.local:7077] ->
> [akka.tcp://driverClient@192.168.1.106:56834]: Error [Association failed
> with [akka.tcp://driverClient@192.168.1.106:56834]] [
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://driverClient@192.168.1.106:56834]
> Caused by:
> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> Connection refused: /192.168.1.106:56834
> ]
> 14/02/26 15:38:42 ERROR EndpointWriter: AssociationError
> [akka.tcp://sparkMaster@Shirishs-MacBook-Pro.local:7077] ->
> [akka.tcp://driverClient@192.168.1.106:56834]: Error [Association failed
> with [akka.tcp://driverClient@192.168.1.106:56834]] [
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://driverClient@192.168.1.106:56834]
> Caused by:
> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> Connection refused: /192.168.1.106:56834
> ]
> 14/02/26 15:38:42 INFO Master: akka.tcp://driverClient@192.168.1.106:56834
> got disassociated, removing it.
> 14/02/26 15:40:52 WARN Master: Got heartbeat from unregistered worker
> worker-20140226153839-192.168.1.106-56830
> 14/02/26 15:41:09 WARN Master: Got heartbeat from unregistered worker
> worker-20140226153839-192.168.1.106-56830
>
> *The worker log is:*
>
> 14/02/26 15:38:36 INFO Worker: Asked to launch driver
> driver-20140226153836-0000
> 2014-02-26 15:38:36.790 java[14619:3c0b] Unable to load realm info from
> SCDynamicStore
> 14/02/26 15:38:36 INFO DriverRunner: Copying user jar
>
> file:/Users/shirish_kumar/Developer/spark_app/SimpleApp/target/scala-2.10/simple-project_2.10-1.0.jar
> to
>
> /Users/shirish_kumar/Developer/spark-0.9.0-incubating/work/driver-20140226153836-0000/simple-project_2.10-1.0.jar
> 14/02/26 15:38:37 INFO DriverRunner: Launch Command:
> "/Library/Java/JavaVirtualMachines/jdk1.7.0_40.jdk/Contents/Home/bin/java"
> "-cp"
>
> ":/Users/shirish_kumar/Developer/spark-0.9.0-incubating/work/driver-20140226153836-0000/simple-project_2.10-1.0.jar:/Users/shirish_kumar/Developer/spark-0.9.0-incubating/conf:/Users/shirish_kumar/Developer/spark-0.9.0-incubating/assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop1.0.4.jar"
> "-Dspark.worker.timeout=600" "-Dspark.akka.timeout=200"
> "-Dspark.worker.timeout=600" "-Dspark.akka.timeout=200" "-Xms512M"
> "-Xmx512M" "org.apache.spark.deploy.worker.DriverWrapper"
> "akka.tcp://sparkWorker@192.168.1.106:56830/user/Worker" "SimpleApp"
> 14/02/26 15:38:39 ERROR OneForOneStrategy: FAILED (of class
> scala.Enumeration$Val)
> scala.MatchError: FAILED (of class scala.Enumeration$Val)
>         at
>
> org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:277)
>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>         at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>         at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>         at
>
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>         at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at
>
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>         at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>         at
>
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 14/02/26 15:38:39 INFO LocalActorRef: Message
> [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
> Actor[akka://sparkWorker/deadLetters] to
>
> Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%40192.168.1.106%3A56838-2#531095069]
> was not delivered. [1] dead letters encountered. This logging can be turned
> off or adjusted with configuration settings 'akka.log-dead-letters' and
> 'akka.log-dead-letters-during-shutdown'.
> 14/02/26 15:38:39 ERROR EndpointWriter: AssociationError
> [akka.tcp://sparkWorker@192.168.1.106:56830] ->
> [akka.tcp://Driver@192.168.1.106:56836]: Error [Association failed with
> [akka.tcp://Driver@192.168.1.106:56836]] [
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://Driver@192.168.1.106:56836]
> Caused by:
> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> Connection refused: /192.168.1.106:56836
> ]
> 14/02/26 15:38:39 ERROR EndpointWriter: AssociationError
> [akka.tcp://sparkWorker@192.168.1.106:56830] ->
> [akka.tcp://Driver@192.168.1.106:56836]: Error [Association failed with
> [akka.tcp://Driver@192.168.1.106:56836]] [
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://Driver@192.168.1.106:56836]
> Caused by:
> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> Connection refused: /192.168.1.106:56836
> ]
> 14/02/26 15:38:39 ERROR EndpointWriter: AssociationError
> [akka.tcp://sparkWorker@192.168.1.106:56830] ->
> [akka.tcp://Driver@192.168.1.106:56836]: Error [Association failed with
> [akka.tcp://Driver@192.168.1.106:56836]] [
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://Driver@192.168.1.106:56836]
> Caused by:
> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> Connection refused: /192.168.1.106:56836
> ]
> 14/02/26 15:38:39 INFO Worker: Starting Spark worker 192.168.1.106:56830
> with 4 cores, 15.0 GB RAM
> 14/02/26 15:38:39 INFO Worker: Spark home:
> /Users/shirish_kumar/Developer/spark-0.9.0-incubating
> 14/02/26 15:38:39 INFO WorkerWebUI: Started Worker web UI at
> http://192.168.1.106:8081
> 14/02/26 15:38:39 INFO Worker: Connecting to master
> spark://Shirishs-MacBook-Pro.local:7077...
> 14/02/26 15:38:39 INFO Worker: Successfully registered with master
> spark://Shirishs-MacBook-Pro.local:7077
>
>
> The WebUI (8080) shows the worker as dead and the "new" worker never gets
> registered and I can no longer submit any jobs.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/worker-keeps-getting-disassociated-upon-a-failed-job-spark-version-0-90-tp2099.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Mime
View raw message