spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun Ahuja <>
Subject Re: advice on maintaining a production spark cluster?
Date Tue, 20 May 2014 18:36:30 GMT
Hi Matei,

Unfortunately, I don't have more detailed information, but we have seen the
loss of workers in standalone mode as well.  If a job is killed through
CTRL-C we will often see in the Spark Master page the number of workers and
cores decrease.  They are still alive and well in the Cloudera Manager
page, but not visible on the Spark master, simply restarting the workers
usually resolves this, but we often seen workers disappear after a failed
or killed job.

If we see this occur again, I'll try and provide some logs.

On Mon, May 19, 2014 at 10:51 PM, Matei Zaharia <>wrote:

> Which version is this with? I haven’t seen standalone masters lose
> workers. Is there other stuff on the machines that’s killing them, or what
> errors do you see?
> Matei
> On May 16, 2014, at 9:53 AM, Josh Marcus <> wrote:
> > Hey folks,
> >
> > I'm wondering what strategies other folks are using for maintaining and
> monitoring the stability of stand-alone spark clusters.
> >
> > Our master very regularly loses workers, and they (as expected) never
> rejoin the cluster.  This is the same behavior I've seen
> > using akka cluster (if that's what spark is using in stand-alone mode)
> -- are there configuration options we could be setting
> > to make the cluster more robust?
> >
> > We have a custom script which monitors the number of workers (through
> the web interface) and restarts the cluster when
> > necessary, as well as resolving other issues we face (like spark shells
> left open permanently claiming resources), and it
> > works, but it's no where close to a great solution.
> >
> > What are other folks doing?  Is this something that other folks observe
> as well?  I suspect that the loss of workers is tied to
> > jobs that run out of memory on the client side or our use of very large
> broadcast variables, but I don't have an isolated test case.
> > I'm open to general answers here: for example, perhaps we should simply
> be using mesos or yarn instead of stand-alone mode.
> >
> > --j
> >

View raw message