spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <matei.zaha...@gmail.com>
Subject Re: advice on maintaining a production spark cluster?
Date Tue, 20 May 2014 19:11:18 GMT
Are you guys both using Cloudera Manager? Maybe there’s also an issue with the integration
with that.

Matei

On May 20, 2014, at 11:44 AM, Aaron Davidson <ilikerps@gmail.com> wrote:

> I'd just like to point out that, along with Matei, I have not seen workers drop even
under the most exotic job failures. We're running pretty close to master, though; perhaps
it is related to an uncaught exception in the Worker from a prior version of Spark.
> 
> 
> On Tue, May 20, 2014 at 11:36 AM, Arun Ahuja <aahuja11@gmail.com> wrote:
> Hi Matei,
> 
> Unfortunately, I don't have more detailed information, but we have seen the loss of workers
in standalone mode as well.  If a job is killed through CTRL-C we will often see in the Spark
Master page the number of workers and cores decrease.  They are still alive and well in the
Cloudera Manager page, but not visible on the Spark master, simply restarting the workers
usually resolves this, but we often seen workers disappear after a failed or killed job.
> 
> If we see this occur again, I'll try and provide some logs.
> 
> 
> 
> 
> On Mon, May 19, 2014 at 10:51 PM, Matei Zaharia <matei.zaharia@gmail.com> wrote:
> Which version is this with? I haven’t seen standalone masters lose workers. Is there
other stuff on the machines that’s killing them, or what errors do you see?
> 
> Matei
> 
> On May 16, 2014, at 9:53 AM, Josh Marcus <jmarcus@meetup.com> wrote:
> 
> > Hey folks,
> >
> > I'm wondering what strategies other folks are using for maintaining and monitoring
the stability of stand-alone spark clusters.
> >
> > Our master very regularly loses workers, and they (as expected) never rejoin the
cluster.  This is the same behavior I've seen
> > using akka cluster (if that's what spark is using in stand-alone mode) -- are there
configuration options we could be setting
> > to make the cluster more robust?
> >
> > We have a custom script which monitors the number of workers (through the web interface)
and restarts the cluster when
> > necessary, as well as resolving other issues we face (like spark shells left open
permanently claiming resources), and it
> > works, but it's no where close to a great solution.
> >
> > What are other folks doing?  Is this something that other folks observe as well?
 I suspect that the loss of workers is tied to
> > jobs that run out of memory on the client side or our use of very large broadcast
variables, but I don't have an isolated test case.
> > I'm open to general answers here: for example, perhaps we should simply be using
mesos or yarn instead of stand-alone mode.
> >
> > --j
> >
> 
> 
> 


Mime
View raw message