spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Davidson <ilike...@gmail.com>
Subject Re: advice on maintaining a production spark cluster?
Date Tue, 20 May 2014 18:44:11 GMT
I'd just like to point out that, along with Matei, I have not seen workers
drop even under the most exotic job failures. We're running pretty close to
master, though; perhaps it is related to an uncaught exception in the
Worker from a prior version of Spark.


On Tue, May 20, 2014 at 11:36 AM, Arun Ahuja <aahuja11@gmail.com> wrote:

> Hi Matei,
>
> Unfortunately, I don't have more detailed information, but we have seen
> the loss of workers in standalone mode as well.  If a job is killed through
> CTRL-C we will often see in the Spark Master page the number of workers and
> cores decrease.  They are still alive and well in the Cloudera Manager
> page, but not visible on the Spark master, simply restarting the workers
> usually resolves this, but we often seen workers disappear after a failed
> or killed job.
>
> If we see this occur again, I'll try and provide some logs.
>
>
>
>
> On Mon, May 19, 2014 at 10:51 PM, Matei Zaharia <matei.zaharia@gmail.com>wrote:
>
>> Which version is this with? I haven’t seen standalone masters lose
>> workers. Is there other stuff on the machines that’s killing them, or what
>> errors do you see?
>>
>> Matei
>>
>> On May 16, 2014, at 9:53 AM, Josh Marcus <jmarcus@meetup.com> wrote:
>>
>> > Hey folks,
>> >
>> > I'm wondering what strategies other folks are using for maintaining and
>> monitoring the stability of stand-alone spark clusters.
>> >
>> > Our master very regularly loses workers, and they (as expected) never
>> rejoin the cluster.  This is the same behavior I've seen
>> > using akka cluster (if that's what spark is using in stand-alone mode)
>> -- are there configuration options we could be setting
>> > to make the cluster more robust?
>> >
>> > We have a custom script which monitors the number of workers (through
>> the web interface) and restarts the cluster when
>> > necessary, as well as resolving other issues we face (like spark shells
>> left open permanently claiming resources), and it
>> > works, but it's no where close to a great solution.
>> >
>> > What are other folks doing?  Is this something that other folks observe
>> as well?  I suspect that the loss of workers is tied to
>> > jobs that run out of memory on the client side or our use of very large
>> broadcast variables, but I don't have an isolated test case.
>> > I'm open to general answers here: for example, perhaps we should simply
>> be using mesos or yarn instead of stand-alone mode.
>> >
>> > --j
>> >
>>
>>
>

Mime
View raw message