spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nan Zhu <zhunanmcg...@gmail.com>
Subject Re: master attempted to re-register the worker and then took all workers as unregistered
Date Wed, 15 Jan 2014 13:02:37 GMT
I got the reason for the weird behaviour  

the executor throws an exception due to the bug in application code (I forgot to set an env
variable used in the application code in every machine) when starting  

then the master seems to remove the worker from the list (?) but the worker keeps sending
the heartbeat but gets no reply, finally all workers are dead…

but obviously it should not work in this way, the problematic application code should not
make all workers dead

I’m checking the source code to find the reason

Best,

--  
Nan Zhu


On Tuesday, January 14, 2014 at 8:53 PM, Nan Zhu wrote:

> Hi, all  
>  
> I’m trying to deploy spark in standalone mode, everything goes as usual,  
>  
> the webUI is accessible, the master node wrote some logs saying all workers are registered
>  
> 14/01/15 01:37:30 INFO Slf4jEventHandler: Slf4jEventHandler started  
> 14/01/15 01:37:31 INFO ActorSystemImpl: RemoteServerStarted@akka://sparkMaster@172.31.36.93
(mailto:sparkMaster@172.31.36.93):7077
> 14/01/15 01:37:31 INFO Master: Starting Spark master at spark://172.31.36.93:7077
> 14/01/15 01:37:31 INFO MasterWebUI: Started Master web UI at http://ip-172-31-36-93.us-west-2.compute.internal:8080
> 14/01/15 01:37:31 INFO Master: I have been elected leader! New state: ALIVE
> 14/01/15 01:37:34 INFO ActorSystemImpl: RemoteClientStarted@akka://sparkWorker@ip-172-31-34-61.us-west-2.compute.internal
(mailto:sparkWorker@ip-172-31-34-61.us-west-2.compute.internal):37914
> 14/01/15 01:37:34 INFO ActorSystemImpl: RemoteClientStarted@akka://sparkWorker@ip-172-31-40-28.us-west-2.compute.internal
(mailto:sparkWorker@ip-172-31-40-28.us-west-2.compute.internal):43055
> 14/01/15 01:37:34 INFO Master: Registering worker ip-172-31-34-61.us-west-2.compute.internal:37914
with 2 cores, 6.3 GB RAM
> 14/01/15 01:37:34 INFO ActorSystemImpl: RemoteClientStarted@akka://sparkWorker@ip-172-31-45-211.us-west-2.compute.internal
(mailto:sparkWorker@ip-172-31-45-211.us-west-2.compute.internal):55355
> 14/01/15 01:37:34 INFO Master: Registering worker ip-172-31-40-28.us-west-2.compute.internal:43055
with 2 cores, 6.3 GB RAM
> 14/01/15 01:37:34 INFO Master: Registering worker ip-172-31-45-211.us-west-2.compute.internal:55355
with 2 cores, 6.3 GB RAM
> 14/01/15 01:37:34 INFO ActorSystemImpl: RemoteClientStarted@akka://sparkWorker@ip-172-31-41-251.us-west-2.compute.internal
(mailto:sparkWorker@ip-172-31-41-251.us-west-2.compute.internal):47709
> 14/01/15 01:37:34 INFO Master: Registering worker ip-172-31-41-251.us-west-2.compute.internal:47709
with 2 cores, 6.3 GB RAM
> 14/01/15 01:37:34 INFO ActorSystemImpl: RemoteClientStarted@akka://sparkWorker@ip-172-31-43-78.us-west-2.compute.internal
(mailto:sparkWorker@ip-172-31-43-78.us-west-2.compute.internal):36257
> 14/01/15 01:37:34 INFO Master: Registering worker ip-172-31-43-78.us-west-2.compute.internal:36257
with 2 cores, 6.3 GB RAM
> 14/01/15 01:38:44 INFO ActorSystemImpl: RemoteClientStarted@akka://spark@ip-172-31-37-160.us-west-2.compute.internal
(mailto:spark@ip-172-31-37-160.us-west-2.compute.internal):43086
>  
>  
>  
>  
> However, when I launched an application, the master firstly “attempted to re-register
the worker” and then said that all heartbeats are from “unregistered” workers. Can anyone
told me what happened here?
>  
> 14/01/15 01:38:44 INFO Master: Registering app ALS  
> 14/01/15 01:38:44 INFO Master: Registered app ALS with ID app-20140115013844-0000
> 14/01/15 01:38:44 INFO Master: Launching executor app-20140115013844-0000/0 on worker
worker-20140115013734-ip-172-31-43-78.us-west-2.compute.internal-36257
> 14/01/15 01:38:44 INFO Master: Launching executor app-20140115013844-0000/1 on worker
worker-20140115013734-ip-172-31-40-28.us-west-2.compute.internal-43055
> 14/01/15 01:38:44 INFO Master: Launching executor app-20140115013844-0000/2 on worker
worker-20140115013734-ip-172-31-34-61.us-west-2.compute.internal-37914
> 14/01/15 01:38:44 INFO Master: Launching executor app-20140115013844-0000/3 on worker
worker-20140115013734-ip-172-31-45-211.us-west-2.compute.internal-55355
> 14/01/15 01:38:44 INFO Master: Launching executor app-20140115013844-0000/4 on worker
worker-20140115013734-ip-172-31-41-251.us-west-2.compute.internal-47709
> 14/01/15 01:38:44 INFO Master: Registering worker ip-172-31-40-28.us-west-2.compute.internal:43055
with 2 cores, 6.3 GB RAM
> 14/01/15 01:38:44 INFO Master: Attempted to re-register worker at same address: akka://sparkWorker@ip-172-31-40-28.us-west-2.compute.internal
(mailto:sparkWorker@ip-172-31-40-28.us-west-2.compute.internal):43055
> 14/01/15 01:38:44 INFO Master: Registering worker ip-172-31-34-61.us-west-2.compute.internal:37914
with 2 cores, 6.3 GB RAM
> 14/01/15 01:38:44 INFO Master: Attempted to re-register worker at same address: akka://sparkWorker@ip-172-31-34-61.us-west-2.compute.internal
(mailto:sparkWorker@ip-172-31-34-61.us-west-2.compute.internal):37914
> 14/01/15 01:38:44 INFO Master: Registering worker ip-172-31-41-251.us-west-2.compute.internal:47709
with 2 cores, 6.3 GB RAM
> 14/01/15 01:38:44 INFO Master: Attempted to re-register worker at same address: akka://sparkWorker@ip-172-31-41-251.us-west-2.compute.internal
(mailto:sparkWorker@ip-172-31-41-251.us-west-2.compute.internal):47709
> 14/01/15 01:38:44 INFO Master: Registering worker ip-172-31-45-211.us-west-2.compute.internal:55355
with 2 cores, 6.3 GB RAM
> 14/01/15 01:38:44 INFO Master: Attempted to re-register worker at same address: akka://sparkWorker@ip-172-31-45-211.us-west-2.compute.internal
(mailto:sparkWorker@ip-172-31-45-211.us-west-2.compute.internal):55355
> 14/01/15 01:38:44 INFO Master: Registering worker ip-172-31-43-78.us-west-2.compute.internal:36257
with 2 cores, 6.3 GB RAM
> 14/01/15 01:38:44 INFO Master: Attempted to re-register worker at same address: akka://sparkWorker@ip-172-31-43-78.us-west-2.compute.internal
(mailto:sparkWorker@ip-172-31-43-78.us-west-2.compute.internal):36257
> 14/01/15 01:38:44 WARN Master: Got heartbeat from unregistered worker worker-20140115013844-ip-172-31-34-61.us-west-2.compute.internal-37914
> 14/01/15 01:38:44 WARN Master: Got heartbeat from unregistered worker worker-20140115013844-ip-172-31-45-211.us-west-2.compute.internal-55355
> 14/01/15 01:38:44 WARN Master: Got heartbeat from unregistered worker worker-20140115013844-ip-172-31-40-28.us-west-2.compute.internal-43055
> 14/01/15 01:38:44 WARN Master: Got heartbeat from unregistered worker worker-20140115013844-ip-172-31-43-78.us-west-2.compute.internal-36257
> 14/01/15 01:38:44 WARN Master: Got heartbeat from unregistered worker worker-20140115013844-ip-172-31-41-251.us-west-2.compute.internal-47709
> 14/01/15 01:38:50 WARN Master: Got heartbeat from unregistered worker worker-20140115013844-ip-172-31-45-211.us-west-2.compute.internal-55355
>  
>  
>  
>  
> Thank you very much!
>  
> Best,
>  
> --  
> Nan Zhu
>  


Mime
View raw message