spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shay Seng <s...@urbanengines.com>
Subject Re: Debugging cluster stability, configuration issues
Date Thu, 21 Aug 2014 20:33:56 GMT
Unfortunately it doesn't look like my executors are OOM. On the slave
machines I checked both the logs in /spark/log (which I assume is from the
salve driver?) and in /spark/work/... which I assume are from each
worker/executor.




On Thu, Aug 21, 2014 at 11:19 AM, Yana Kadiyska <yana.kadiyska@gmail.com>
wrote:

> Whenever I've seen this exception it has ALWAYS been the case of an
> executor running out of memory. I don't use checkpointing so not too sure
> about the first item. The rest of them I believe would happen if an
> executor fails and the worker spawns a new executor. Usually a good way to
> verify this is if you look in the driver log, where it says Lost TID
> 102135 to see where TID 102135 was sent to (which worker). If I'm correct
> and an executor has rolled you would see two executor logs for your
> application -- the first one usually contains an OOM. I run 0.9.1 but I
> believe it should be a pretty similar setup.
>
>
> On Thu, Aug 21, 2014 at 1:23 PM, Shay Seng <shay@urbanengines.com> wrote:
>
>> Hi,
>>
>> I am running Spark 0.9.2 on an EC2 cluster with about 16 r3.4xlarge
>> machines
>> The cluster is running Spark standalone and is launched with the ec2
>> scripts.
>> In my Spark job, I am using ephemeral HDFS to checkpoint some of my RDDs.
>> I'm also reading and writing to S3. My jobs also involve a large amountf of
>> shuffles.
>>
>> I run the same job on multiple set of data and for 50-70% of these runs,
>> the job completes with no issues. (Typically a rerun will allow the
>> "failures" to complete as well)
>>
>> However on the rest of the 30%, I see a bunch of different kinds of
>> issues pop up. (which will go away if I rerun the same job)
>>
>> (1) Checkpointing silently fails (I assume). the checkpoint dir exists in
>> HDFS, but no data files are written out. And a later step in the job tries
>> to reload these RDDs and I get a failure about not being able to read from
>> HDFS. -- Usually a start, stop-dfs "cures" this.
>> *Q: What could be the cause of this? Timeouts? *
>>
>>
>> (2) Other times I get ... no idea who or what is causing this...
>> in master /spark/logs:
>> 2014-08-21 16:46:15 ERROR EndpointWriter: AssociationError [akka.tcp://
>> sparkMaster@ec2-54-218-216-19.us-west-2.compute.amazonaws.com:7077] ->
>> [akka.tcp://spark@ip-10-34-2-246.us-west-2.compute.internal:37681]:
>> Error [Association failed with [akka.tcp://spark@ip-10
>> -34-2-246.us-west-2.compute.internal:37681]] [
>> akka.remote.EndpointAssociationException: Association failed with
>> [akka.tcp://spark@ip-10-34-2-246.us-west-2.compute.internal:37681]
>> Caused by:
>> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
>> Connection refused: ip-10-34-2-246.us-west-2.compute.internal/
>> 10.34.2.246:37681
>> ]
>>
>> Slave Log:
>> 2014-08-21 16:46:47 INFO ConnectionManager: Removing SendingConnection to
>> ConnectionManagerId(ip-10-33-7-4.us-west-2.compute.internal,33242)
>> 2014-08-21 16:46:47 ERROR SendingConnection: Exception while reading
>> SendingConnection to
>> ConnectionManagerId(ip-10-33-7-4.us-west-2.compute.internal,33242)
>> java.nio.channels.ClosedChannelException
>>         at
>> sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252)
>>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295)
>>         at
>> org.apache.spark.network.SendingConnection.read(Connection.scala:398)
>>         at
>> org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:158)
>>         at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>         at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>         at java.lang.Thread.run(Thread.java:744)
>> *Q: Where do I even start debugging this kind of issues? Are the machines
>> too loaded and so timeouts are getting hit? Am I not setting some
>> configuration number correctly? I would be grateful for some hints on where
>> to start looking!*
>>
>>
>> (3) Often (2) will be preceeded by the following in spark.logs..
>> 2014-08-21 16:34:10 WARN TaskSetManager: Lost TID 102135 (task 398.0:147)
>> 2014-08-21 16:34:10 WARN TaskSetManager: Loss was due to fetch failure
>> from BlockManagerId(0, ip-10-33-131-250.us-west-2.compute.internal, 51371,
>> 0)
>> 2014-08-21 16:34:10 WARN TaskSetManager: Loss was due to fetch failure
>> from BlockManagerId(0, ip-10-33-131-250.us-west-2.compute.internal, 51371,
>> 0)
>> 2014-08-21 16:34:10 WARN TaskSetManager: Loss was due to fetch failure
>> from BlockManagerId(0, ip-10-33-131-250.us-west-2.compute.internal, 51371,
>> 0)
>> Not sure if this is an indication...
>>
>>
>>
>> I'll be very grateful for any ideas on how to start debugging these.
>> Is there anything I should be noting -- CPU using on Master/Slave. Number
>> of executors/cpu, akka threads etc?
>>
>> Cheers,
>> shay
>>
>
>

Mime
View raw message