spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: advice on maintaining a production spark cluster?
Date Tue, 20 May 2014 19:53:34 GMT
This isn't helpful of me to say, but, I see the same sorts of problem
and messages semi-regularly on CDH5 + 0.9.0. I don't have any insight
into when it happens, but usually after heavy use and after running
for a long time. I had figured I'd see if the changes since 0.9.0
addressed it and revisit later.

On Tue, May 20, 2014 at 8:37 PM, Josh Marcus <jmarcus@meetup.com> wrote:
> So, for example, I have two disassociated worker machines at the moment.
> The last messages in the spark logs are akka association error messages,
> like the following:
>
> 14/05/20 01:22:54 ERROR EndpointWriter: AssociationError
> [akka.tcp://sparkWorker@hdn3.int.meetup.com:50038] ->
> [akka.tcp://sparkExecutor@hdn3.int.meetup.com:46288]: Error [Association
> failed with [akka.tcp://sparkExecutor@hdn3.int.meetup.com:46288]] [
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://sparkExecutor@hdn3.int.meetup.com:46288]
> Caused by:
> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> Connection refused: hdn3.int.meetup.com/10.3.6.23:46288
> ]
>
> On the master side, there are lots and lots of messages of the form:
>
> 14/05/20 15:36:58 WARN Master: Got heartbeat from unregistered worker
> worker-20140520011737-hdn3.int.meetup.com-50038
>
> --j
>
>

Mime
View raw message