Hi Matei,

Unfortunately, I don't have more detailed information, but we have seen the loss of workers in standalone mode as well.  If a job is killed through CTRL-C we will often see in the Spark Master page the number of workers and cores decrease.  They are still alive and well in the Cloudera Manager page, but not visible on the Spark master, simply restarting the workers usually resolves this, but we often seen workers disappear after a failed or killed job.

If we see this occur again, I'll try and provide some logs.




On Mon, May 19, 2014 at 10:51 PM, Matei Zaharia <matei.zaharia@gmail.com> wrote:
Which version is this with? I haven’t seen standalone masters lose workers. Is there other stuff on the machines that’s killing them, or what errors do you see?

Matei

On May 16, 2014, at 9:53 AM, Josh Marcus <jmarcus@meetup.com> wrote:

> Hey folks,
>
> I'm wondering what strategies other folks are using for maintaining and monitoring the stability of stand-alone spark clusters.
>
> Our master very regularly loses workers, and they (as expected) never rejoin the cluster.  This is the same behavior I've seen
> using akka cluster (if that's what spark is using in stand-alone mode) -- are there configuration options we could be setting
> to make the cluster more robust?
>
> We have a custom script which monitors the number of workers (through the web interface) and restarts the cluster when
> necessary, as well as resolving other issues we face (like spark shells left open permanently claiming resources), and it
> works, but it's no where close to a great solution.
>
> What are other folks doing?  Is this something that other folks observe as well?  I suspect that the loss of workers is tied to
> jobs that run out of memory on the client side or our use of very large broadcast variables, but I don't have an isolated test case.
> I'm open to general answers here: for example, perhaps we should simply be using mesos or yarn instead of stand-alone mode.
>
> --j
>