We're using spark 0.9.0, and we're using it "out of the box" -- not using Cloudera Manager or anything similar.  

There are warnings from the master that there continue to be heartbeats from the unregistered workers.   I will see if there are particular telltale errors on the worker side.   

We've had occasional problems with running out of memory on the driver side (esp. with large broadcast variables) so that may be related.  


On Tuesday, May 20, 2014, Matei Zaharia <matei.zaharia@gmail.com> wrote:
Are you guys both using Cloudera Manager? Maybe there’s also an issue with the integration with that.


On May 20, 2014, at 11:44 AM, Aaron Davidson <ilikerps@gmail.com> wrote:

I'd just like to point out that, along with Matei, I have not seen workers drop even under the most exotic job failures. We're running pretty close to master, though; perhaps it is related to an uncaught exception in the Worker from a prior version of Spark.

On Tue, May 20, 2014 at 11:36 AM, Arun Ahuja <aahuja11@gmail.com> wrote:
Hi Matei,

Unfortunately, I don't have more detailed information, but we have seen the loss of workers in standalone mode as well.  If a job is killed through CTRL-C we will often see in the Spark Master page the number of workers and cores decrease.  They are still alive and well in the Cloudera Manager page, but not visible on the Spark master, simply restarting the workers usually resolves this, but we often seen workers disappear after a failed or killed job.

If we see this occur again, I'll try and provide some logs.

On Mon, May 19, 2014 at 10:51 PM, Matei Zaharia <matei.zaharia@gmail.com> wrote:
Which version is this with? I haven’t seen standalone masters lose workers. Is there other stuff on the machines that’s killing them, or what errors do you see?


On May 16, 2014, at 9:53 AM, Josh Marcus <jmarcus@meetup.com> wrote:

> Hey folks,
> I'm wondering what strategies other folks are using for maintaining and monitoring the stability of stand-alone spark clusters.
> Our master very regularly loses workers, and they (as expected) never rejoin the cluster.  This is the same behavior I've seen
> using akka cluster (if that's what spark is using in stand-alone mode) -- are there configuration options we could be setting
> to make the cluster more robust?
> We have a custom script which monitors the number of workers (through the web interface) and restarts the cluster when
> necessary, as well as resolving other issues we face (like spark shells left open permanently claiming resources), and it
> works, but it's no where close to a great solution.
> What are other folks doing?  Is this something that other folks observe as well?  I suspect that the loss of workers is tied to
> jobs that run out of memory on the client side or our use of very large broadcast variables, but I don't have an isolated test case.
> I'm open to general answers here: for example, perhaps we should simply be using mesos or yarn instead of stand-alone mode.
> --j