We have an application that reads text files, converts them to dataframes, and saves them in Parquet format. The application runs fine when processing a few files, but we have several thousand produced every day. When running the job for all files, we have spark-submit killed on OOM:

# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 27226"...

The job is written in Python. We’re running it in Amazon EMR 5.0 (Spark 2.0.0) with spark-submit. We’re using a cluster with a master c3.2xlarge instance (8 cores and 15g of RAM) and 3 core c3.4xlarge instances (16 cores and 30g of RAM each). Spark config settings are as follows:

('spark.serializer', 'org.apache.spark.serializer.KryoSerializer'),

('spark.executors.instances', '3'),

('spark.yarn.executor.memoryOverhead', '9g'),

('spark.executor.cores', '15'),

('spark.executor.memory', '12g'),

('spark.scheduler.mode', 'FIFO'),

('spark.cleaner.ttl', '1800'),

The job processes each file in a thread, and we have 10 threads running concurrently. The process will OOM after about 4 hours, at which point Spark has processed over 20,000 jobs.

It seems like the driver is running out of memory, but each individual job is quite small. Are there any known memory leaks for long-running Spark applications on Yarn?