spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Gerber <thomas.ger...@radius.com>
Subject Re: Driver disassociated
Date Thu, 05 Mar 2015 00:09:49 GMT
Also,

I was experiencing another problem which might be related:
"Error communicating with MapOutputTracker" (see email in the ML today).

I just thought I would mention it in case it is relevant.

On Wed, Mar 4, 2015 at 4:07 PM, Thomas Gerber <thomas.gerber@radius.com>
wrote:

> 1.2.1
>
> Also, I was using the following parameters, which are 10 times the default
> ones:
> spark.akka.timeout 1000
> spark.akka.heartbeat.pauses 60000
> spark.akka.failure-detector.threshold 3000.0
> spark.akka.heartbeat.interval 10000
>
> which should have helped *avoid* the problem if I understand correctly.
>
> Thanks,
> Thomas
>
> On Wed, Mar 4, 2015 at 3:21 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>
>> What release are you using ?
>>
>> SPARK-3923 went into 1.2.0 release.
>>
>> Cheers
>>
>> On Wed, Mar 4, 2015 at 1:39 PM, Thomas Gerber <thomas.gerber@radius.com>
>> wrote:
>>
>>> Hello,
>>>
>>> sometimes, in the *middle* of a job, the job stops (status is then seen
>>> as FINISHED in the master).
>>>
>>> There isn't anything wrong in the shell/submit output.
>>>
>>> When looking at the executor logs, I see logs like this:
>>>
>>> 15/03/04 21:24:51 INFO MapOutputTrackerWorker: Doing the fetch; tracker
>>> actor = Actor[akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal
>>> :40019/user/MapOutputTracker#893807065]
>>> 15/03/04 21:24:51 INFO MapOutputTrackerWorker: Don't have map outputs
>>> for shuffle 38, fetching them
>>> 15/03/04 21:24:55 ERROR CoarseGrainedExecutorBackend: Driver
>>> Disassociated [akka.tcp://sparkExecutor@ip-10-0-11-9.ec2.internal:54766]
>>> -> [akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal:40019]
>>> disassociated! Shutting down.
>>> 15/03/04 21:24:55 WARN ReliableDeliverySupervisor: Association with
>>> remote system [akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal:40019]
>>> has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
>>>
>>> How can I investigate further?
>>> Thanks
>>>
>>
>>
>

Mime
View raw message