We have an application that reads text files, converts them to dataframes, and saves them in Parquet format. The application runs fine when processing a few files, but we have several thousand produced every day. When running the job for all files, we have spark-submit killed on OOM:
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 27226"...
The job is written in Python. We’re running it in Amazon EMR 5.0 (Spark 2.0.0) with spark-submit. We’re using a cluster with a master c3.2xlarge instance (8 cores and 15g of RAM) and 3 core c3.4xlarge instances (16 cores and 30g of RAM each). Spark config settings are as follows:
The job processes each file in a thread, and we have 10 threads running concurrently. The process will OOM after about 4 hours, at which point Spark has processed over 20,000 jobs.