We are having the same problem. We're running Spark 0.9.1 in standalone mode and on some heavy jobs workers become unresponsive and marked by master as dead, even though the worker process is still running. Then they never join the cluster again and cluster becomes essentially unusable until we restart each worker.

We'd like to know:
1. Why worker can become unresponsive? Are there any well known config / usage pitfalls that we could have fallen into? We're still investigating the issue, but maybe there are some hints?
2. Is there an option to auto-recover a worker? e.g. automatically start a new one if the old one failed? or at least some hooks to implement functionality liek that?


2014-06-13 22:58 GMT+02:00 Gino Bustelo <gino@bustelos.com>:
I get the same problem, but I'm running in a dev environment based on docker scripts. The additional issue is that the worker processes do not die and so the docker container does not exit. So I end up with worker containers that are not participating in the cluster.

On Fri, Jun 13, 2014 at 9:44 AM, Mayur Rustagi <mayur.rustagi@gmail.com> wrote:
I have also had trouble in worker joining the working set. I have typically moved to Mesos based setup. Frankly for high availability you are better off using a cluster manager. 

On Fri, Jun 13, 2014 at 8:57 AM, Yana Kadiyska <yana.kadiyska@gmail.com> wrote:
Hi, I see this has been asked before but has not gotten any satisfactory answer so I'll try again:

I have a set of workers dying and coming back again. The master prints the following warning:

"Got heartbeat from unregistered worker ...."

What is the solution to this -- rolling the master is very undesirable to me as I have a Shark context sitting on top of it (it's meant to be highly available).

Insights appreciated -- I don't think an executor going down is very unexpected but it does seem odd that it won't be able to rejoin the working set.

I'm running Spark 0.9.1 on CDH

Piotr Kolaczkowski, Lead Software Engineer

777 Mariners Island Blvd., Suite 510
San Mateo, CA 94404