spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Or <and...@databricks.com>
Subject Re: weird YARN errors on new Spark on Yarn cluster
Date Thu, 02 Oct 2014 17:37:59 GMT
Ah I see, you were running in yarn cluster mode so the logs are the same.
Glad you figured it out.

2014-10-02 10:31 GMT-07:00 Greg Hill <greg.hill@rackspace.com>:

>  So, I actually figured it out, and it's all my fault.  I had an older
> version of spark on the datanodes and was passing
> in spark.executor.extraClassPath to pick it up.  It was a holdover from
> some initial work before I got everything working right.  Once I removed
> that, it picked up the spark JAR from hdfs instead and ran without issue.
>
>  Sorry for the false alarm.
>
>  The AM container logs were what I had pasted in the original email, btw.
>
>  Greg
>
>   From: Andrew Or <andrew@databricks.com>
> Date: Thursday, October 2, 2014 12:24 PM
> To: Greg <greg.hill@rackspace.com>
> Cc: "user@spark.apache.org" <user@spark.apache.org>
> Subject: Re: weird YARN errors on new Spark on Yarn cluster
>
>   Hi Greg,
>
>  Have you looked at the AM container logs? (You may already know this,
> but) you can get these through the RM web UI or through:
>
>  yarn logs -applicationId <your app ID>
>
>  If an AM throws an exception then the executors may not be started
> properly.
>
>  -Andrew
>
>
>
> 2014-10-02 9:47 GMT-07:00 Greg Hill <greg.hill@rackspace.com>:
>
>>  I haven't run into this until today.  I spun up a fresh cluster to do
>> some more testing, and it seems that every single executor fails because it
>> can't connect to the driver.  This is in the YARN logs:
>>
>>  14/10/02 16:24:11 INFO executor.CoarseGrainedExecutorBackend:
>> Connecting to driver: akka.tcp://sparkDriver@GATEWAY-1
>> :60855/user/CoarseGrainedScheduler
>> 14/10/02 16:24:11 ERROR executor.CoarseGrainedExecutorBackend: Driver
>> Disassociated [akka.tcp://sparkExecutor@DATANODE-3:58232] ->
>> [akka.tcp://sparkDriver@GATEWAY-1:60855] disassociated! Shutting down.
>>
>>  And this is what shows up from the driver:
>>
>>  14/10/02 16:43:06 INFO cluster.YarnClientSchedulerBackend: Registered
>> executor: Actor[akka.tcp://sparkExecutor@DATANODE-1:60341/user/Executor#1289950113]
>> with ID 2
>> 14/10/02 16:43:06 INFO util.RackResolver: Resolved DATANODE-1 to
>> /rack/node8da83a04def73517bf437e95aeefa2469b1daf14
>> 14/10/02 16:43:06 INFO cluster.YarnClientSchedulerBackend: Executor 2
>> disconnected, so removing it
>>
>> It doesn't appear to be a networking issue.  Networking works both
>> directions and there's no firewall blocking ports.  Googling the issue, it
>> sounds like the most common problem is overallocation of memory, but I'm
>> not doing that.  I've got these settings for a 3 * 128GB node cluster:
>>
>>  spark.executor.instances            17
>>  spark.executor.memory               12424m
>> spark.yarn.executor.memoryOverhead  3549
>>
>>  That makes it 6 * 15973 = 95838 MB per node, which is well beneath the
>> 128GB limit.
>>
>>  Frankly I'm stumped.  It worked fine when I spun up a cluster last
>> week, but now it doesn't.  The logs give me no indication as to what the
>> problem actually is.  Any pointers to where else I might look?
>>
>>  Thanks in advance.
>>
>>  Greg
>>
>
>

Mime
View raw message