spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Imran Rashid <im...@quantifind.com>
Subject Re: executor failures w/ scala 2.10
Date Fri, 01 Nov 2013 03:05:32 GMT
unfortunately that change wasn't the silver bullet I was hoping for.  Even
with
1) ignoring DisassociatedEvent
2) executor uses ReliableProxy to send messages back to driver
3) turn up akka.remote.watch-failure-detector.threshold=12


there is a lot of weird behavior.  First, there are a few
DisassociatedEvents, but some that are followed by AssociatedEvents, so
that seems ok.  But sometimes the re-associations are immediately followed
by this:

13/10/31 18:51:10 INFO executor.StandaloneExecutorBackend: got
lifecycleevent: AssociationError [akka.tcp://sparkExecutor@<executor>:41441]
-> [akka.tcp://spark@<driver>:41321]: Error [Invalid address:
akka.tcp://spark@<driver>:41321] [
akka.remote.InvalidAssociation: Invalid address: akka.tcp://spark@
<driver>:41321
Caused by: akka.remote.transport.Transport$InvalidAssociationException: The
remote system has quarantined this system. No further associations to the
remote system are possible until this system is restarted.
]

On the driver, there are messages like:

[INFO] [10/31/2013 18:51:07.838] [spark-akka.actor.default-dispatcher-3]
[Remoting] Address [akka.tcp://sparkExecutor@<executor>:46123] is now
quarantined, all messages to this address will be delivered to dead letters.
[WARN] [10/31/2013 18:51:10.845] [spark-akka.actor.default-dispatcher-20]
[akka://spark/system/remote-watcher] Detected unreachable:
[akka.tcp://sparkExecutor@<executor>:41441]


and when the driver does decide that the executor has been terminated, it
removes the executor, but doesn't start another one.

there are a ton of messages also about messages to the block manager master
... I'm wondering if there are other parts of the system that need to use a
reliable proxy (or some sort of acknowledgement).

I really don't think this was working properly even w/ previous versions of
spark / akka.  I'm still learning about akka, but I think you always need
an ack to be confident w/ remote communicate.  Perhaps the old version of
akka just had more robust defaults or something, but I bet it could still
have the same problems.  Even before, I have seen the driver thinking there
were running tasks, but nothing happening on any executor -- it was just
rare enough (and hard to reproduce) that I never bothered looking into it
more.

I will keep digging ...

On Thu, Oct 31, 2013 at 4:36 PM, Matei Zaharia <matei.zaharia@gmail.com>wrote:

> BTW the problem might be the Akka failure detector settings that seem new
> in 2.2: http://doc.akka.io/docs/akka/2.2.3/scala/remoting.html
>
> Their timeouts seem pretty aggressive by default — around 10 seconds. This
> can easily be too little if you have large garbage collections. We should
> make sure they are higher than our own node failure detection timeouts.
>
> Matei
>
> On Oct 31, 2013, at 1:33 PM, Imran Rashid <imran@quantifind.com> wrote:
>
> pretty sure I found the problem -- two problems actually.  And I think one
> of them has been a general lurking problem w/ spark for a while.
>
> 1)  we should ignore disassociation events, as you suggested earlier.
> They seem to just indicate a temporary problem, and can generally be
> ignored.  I've found that they're regularly followed by AssociatedEvents,
> and it seems communication really works fine at that point.
>
> 2) Task finished messages get lost.  When this message gets sent, we dont'
> know it actually gets there:
>
>
> https://github.com/apache/incubator-spark/blob/scala-2.10/core/src/main/scala/org/apache/spark/executor/StandaloneExecutorBackend.scala#L90
>
> (this is so incredible, I feel I must be overlooking something -- but
> there is no ack somewhere else that I'm overlooking, is there??)  So, after
> the patch, spark wasn't hanging b/c of the unhandled DisassociatedEvent.
> It hangs b/c the executor has sent some taskFinished messages that never
> get received by the driver.  So the driver is waiting for some tasks to
> finish, but the executors think they are all done.
>
> I'm gonna add the reliable proxy pattern for this particular interaction
> and see if its fixes the problem
>
> http://doc.akka.io/docs/akka/2.2.3/contrib/reliable-proxy.html#introducing-the-reliable-proxy
>
> imran
>
>

Mime
View raw message