I get the same problem, but I'm running in a dev environment based on docker scripts. The additional issue is that the worker processes do not die and so the docker container does not exit. So I end up with worker containers that are not participating in the cluster.

On Fri, Jun 13, 2014 at 9:44 AM, Mayur Rustagi <mayur.rustagi@gmail.com> wrote:
I have also had trouble in worker joining the working set. I have typically moved to Mesos based setup. Frankly for high availability you are better off using a cluster manager. 

On Fri, Jun 13, 2014 at 8:57 AM, Yana Kadiyska <yana.kadiyska@gmail.com> wrote:
Hi, I see this has been asked before but has not gotten any satisfactory answer so I'll try again:

I have a set of workers dying and coming back again. The master prints the following warning:

"Got heartbeat from unregistered worker ...."

What is the solution to this -- rolling the master is very undesirable to me as I have a Shark context sitting on top of it (it's meant to be highly available).

Insights appreciated -- I don't think an executor going down is very unexpected but it does seem odd that it won't be able to rejoin the working set.

I'm running Spark 0.9.1 on CDH