spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Marcus <jmar...@meetup.com>
Subject Re: advice on maintaining a production spark cluster?
Date Tue, 20 May 2014 19:37:34 GMT
So, for example, I have two disassociated worker machines at the moment.
 The last messages in the spark logs are akka association error messages,
like the following:

14/05/20 01:22:54 ERROR EndpointWriter: AssociationError [akka.tcp://
sparkWorker@hdn3.int.meetup.com:50038] -> [akka.tcp://
sparkExecutor@hdn3.int.meetup.com:46288]: Error [Association failed with
[akka.tcp://sparkExecutor@hdn3.int.meetup.com:46288]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@hdn3.int.meetup.com:46288]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: hdn3.int.meetup.com/10.3.6.23:46288
]

On the master side, there are lots and lots of messages of the form:

14/05/20 15:36:58 WARN Master: Got heartbeat from unregistered worker
worker-20140520011737-hdn3.int.meetup.com-50038

--j



On Tue, May 20, 2014 at 3:28 PM, Josh Marcus <jmarcus@meetup.com> wrote:

> We're using spark 0.9.0, and we're using it "out of the box" -- not using
> Cloudera Manager or anything similar.
>
> There are warnings from the master that there continue to be heartbeats
> from the unregistered workers.   I will see if there are particular
> telltale errors on the worker side.
>
> We've had occasional problems with running out of memory on the driver
> side (esp. with large broadcast variables) so that may be related.
>
> --j
>
>
> On Tuesday, May 20, 2014, Matei Zaharia <matei.zaharia@gmail.com> wrote:
>
>> Are you guys both using Cloudera Manager? Maybe there’s also an issue
>> with the integration with that.
>>
>> Matei
>>
>> On May 20, 2014, at 11:44 AM, Aaron Davidson <ilikerps@gmail.com> wrote:
>>
>> I'd just like to point out that, along with Matei, I have not seen
>> workers drop even under the most exotic job failures. We're running pretty
>> close to master, though; perhaps it is related to an uncaught exception in
>> the Worker from a prior version of Spark.
>>
>>
>> On Tue, May 20, 2014 at 11:36 AM, Arun Ahuja <aahuja11@gmail.com> wrote:
>>
>>> Hi Matei,
>>>
>>> Unfortunately, I don't have more detailed information, but we have seen
>>> the loss of workers in standalone mode as well.  If a job is killed through
>>> CTRL-C we will often see in the Spark Master page the number of workers and
>>> cores decrease.  They are still alive and well in the Cloudera Manager
>>> page, but not visible on the Spark master, simply restarting the workers
>>> usually resolves this, but we often seen workers disappear after a failed
>>> or killed job.
>>>
>>> If we see this occur again, I'll try and provide some logs.
>>>
>>>
>>>
>>>
>>> On Mon, May 19, 2014 at 10:51 PM, Matei Zaharia <matei.zaharia@gmail.com
>>> > wrote:
>>>
>>>> Which version is this with? I haven’t seen standalone masters lose
>>>> workers. Is there other stuff on the machines that’s killing them, or what
>>>> errors do you see?
>>>>
>>>> Matei
>>>>
>>>> On May 16, 2014, at 9:53 AM, Josh Marcus <jmarcus@meetup.com> wrote:
>>>>
>>>> > Hey folks,
>>>> >
>>>> > I'm wondering what strategies other folks are using for maintaining
>>>> and monitoring the stability of stand-alone spark clusters.
>>>> >
>>>> > Our master very regularly loses workers, and they (as expected) never
>>>> rejoin the cluster.  This is the same behavior I've seen
>>>> > using akka cluster (if that's what spark is using in stand-alone
>>>> mode) -- are there configuration options we could be setting
>>>> > to make the cluster more robust?
>>>> >
>>>> > We have a custom script which monitors the number of workers (through
>>>> the web interface) and restarts the cluster when
>>>> > necessary, as well as resolving other issues we face (like spark
>>>> shells left open permanently claiming resources), and it
>>>> > works, but it's no where close to a great solution.
>>>> >
>>>> > What are other folks doing?  Is this something that other folks
>>>> observe as well?  I suspect that the loss of workers is tied to
>>>> > jobs that run out of memory on the client side or our use of very
>>>> large broadcast variables, but I don't have an isolated test case.
>>>> > I'm open to general answers here: for example, perhaps we should
>>>> simply be using mesos or yarn instead of stand-alone mode.
>>>> >
>>>> > --j
>>>> >
>>>>
>>>>
>>>
>>
>>

Mime
View raw message