We have an application that reads text files, converts them to dataframes, and saves them in Parquet format. The application runs fine when processing a few files, but we have several thousand produced every day. When running the job for all files, we have spark-submit killed on OOM:
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 27226"...
The job is written in Python. We’re running it in Amazon EMR 5.0 (Spark 2.0.0) with spark-submit. We’re using a cluster with a master c3.2xlarge instance (8 cores and 15g of RAM) and 3 core c3.4xlarge instances (16 cores and 30g of RAM each). Spark config settings are as follows:
('spark.serializer', 'org.apache.spark.serializer.KryoSerializer'),
('spark.executors.instances', '3'),
('spark.yarn.executor.memoryOverhead', '9g'),
('spark.executor.cores', '15'),
('spark.executor.memory', '12g'),
('spark.scheduler.mode', 'FIFO'),
('spark.cleaner.ttl', '1800'),
The job processes each file in a thread, and we have 10 threads running concurrently. The process will OOM after about 4 hours, at which point Spark has processed over 20,000 jobs.