spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aniket Bhatnagar <aniket.bhatna...@gmail.com>
Subject Re: OS killing Executor due to high (possibly off heap) memory usage
Date Fri, 25 Nov 2016 17:14:38 GMT
Thanks Rohit, Roddick and Shreya. I tried
changing spark.yarn.executor.memoryOverhead to be 10GB and lowering
executor memory to 30 GB and both of these didn't work. I finally had to
reduce the number of cores per executor to be 18 (from 36) in addition to
setting higher spark.yarn.executor.memoryOverhead and lower executor memory
size. I had to trade off performance for reliability.

Unfortunately, spark does a poor job reporting off heap memory usage. From
the profiler, it seems that the job's heap usage is pretty static but the
off heap memory fluctuates quiet a lot. It looks like bulk of off heap is
used by io.netty.buffer.UnpooledUnsafeDirectByteBuf while the shuffle
client is trying to read block from shuffle service. It looks
like org.apache.spark.network.util.TransportFrameDecoder retains them
in buffers field while decoding responses from the shuffle service. So far,
it's not clear why it needs to hold multiple GBs in the buffers. Perhaps
increasing the number of partitions may help with this.

Thanks,
Aniket

On Fri, Nov 25, 2016 at 1:09 AM Shreya Agarwal <shreyagr@microsoft.com>
wrote:

I don’t think it’s just memory overhead. It might be better to use an
execute with lesser heap space(30GB?). 46 GB would mean more data load into
memory and more GC, which can cause issues.



Also, have you tried to persist data in any way? If so, then that might be
causing an issue.



Lastly, I am not sure if your data has a skew and if that is forcing a lot
of data to be on one executor node.



Sent from my Windows 10 phone



*From: *Rodrick Brown <rodrick@orchardplatform.com>
*Sent: *Friday, November 25, 2016 12:25 AM
*To: *Aniket Bhatnagar <aniket.bhatnagar@gmail.com>
*Cc: *user <user@spark.apache.org>
*Subject: *Re: OS killing Executor due to high (possibly off heap) memory
usage


Try setting spark.yarn.executor.memoryOverhead 10000

On Thu, Nov 24, 2016 at 11:16 AM, Aniket Bhatnagar <
aniket.bhatnagar@gmail.com> wrote:

Hi Spark users

I am running a job that does join of a huge dataset (7 TB+) and the
executors keep crashing randomly, eventually causing the job to crash.
There are no out of memory exceptions in the log and looking at the dmesg
output, it seems like the OS killed the JVM because of high memory usage.
My suspicion is towards off heap usage of executor is causing this as I am
limiting the on heap usage of executor to be 46 GB and each host running
the executor has 60 GB of RAM. After the executor crashes, I can see that
the external shuffle manager
(org.apache.spark.network.server.TransportRequestHandler) logs a lot of
channel closed exceptions in yarn node manager logs. This leads me to
believe that something triggers out of memory during shuffle read. Is there
a configuration to completely disable usage of off heap memory? I have
tried setting spark.shuffle.io.preferDirectBufs=false but the executor is
still getting killed by the same error.

Cluster details:
10 AWS c4.8xlarge hosts
RAM on each host - 60 GB
Number of cores on each host - 36
Additional hard disk on each host - 8 TB

Spark configuration:
dynamic allocation enabled
external shuffle service enabled
spark.driver.memory 1024M
spark.executor.memory 47127M
Spark master yarn-cluster

Sample error in yarn node manager:
2016-11-24 10:34:06,507 ERROR
org.apache.spark.network.server.TransportRequestHandler
(shuffle-server-50): Error sending result
ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=919299554123,
chunkIndex=0},
buffer=FileSegmentManagedBuffer{file=/mnt3/yarn/usercache/hadoop/appcache/application_1479898345621_0006/blockmgr-ad5301a9-e1e9-4723-a8c4-9276971b2259/2c/shuffle_3_963_0.data,
offset=0, length=669014456}} to /10.192.108.170:52782; closing connection
java.nio.channels.ClosedChannelException

Error in dmesg:
[799873.309897] Out of memory: Kill process 50001 (java) score 927 or
sacrifice child
[799873.314439] Killed process 50001 (java) total-vm:65652448kB,
anon-rss:57246528kB, file-rss:0kB

Thanks,
Aniket




-- 

[image: Orchard Platform] <http://www.orchardplatform.com/>

*Rodrick Brown */ *DevOPs*

9174456839 / rodrick@orchardplatform.com

Orchard Platform
101 5th Avenue, 4th Floor, New York, NY

*NOTICE TO RECIPIENTS*: This communication is confidential and intended for
the use of the addressee only. If you are not an intended recipient of this
communication, please delete it immediately and notify the sender by return
email. Unauthorized reading, dissemination, distribution or copying of this
communication is prohibited. This communication does not constitute an
offer to sell or a solicitation of an indication of interest to purchase
any loan, security or any other financial product or instrument, nor is it
an offer to sell or a solicitation of an indication of interest to purchase
any products or services to any persons who are prohibited from receiving
such information under applicable law. The contents of this communication
may not be accurate or complete and are subject to change without notice.
As such, Orchard App, Inc. (including its subsidiaries and affiliates,
"Orchard") makes no representation regarding the accuracy or completeness
of the information contained herein. The intended recipient is advised to
consult its own professional advisors, including those specializing in
legal, tax and accounting matters. Orchard does not provide legal, tax or
accounting advice.

Mime
View raw message