spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Igor Pernek <>
Subject Spark memory optimization
Date Fri, 04 Jul 2014 09:06:43 GMT
Hi all!

I have a folder with 150 G of txt files (around 700 files, on average each
200 MB).

I'm using scala to process the files and calculate some aggregate
statistics in the end. I see two possible approaches to do that: - manually
loop through all the files, do the calculations per file and merge the
results in the end - read the whole folder to one RDD, do all the
operations on this single RDD and let spark do all the parallelization

I'm leaning towards the second approach as it seems cleaner (no need for
parallelization specific code), but I'm wondering if my scenario will fit
the constraints imposed by my hardware and data. I have one workstation
with 16 threads and 64 GB of RAM available (so the parallelization will be
strictly local between different processor cores). I might scale the
infrastructure with more machines later on, but for now I would just like
to focus on tunning the settings for this one workstation scenario.

The code I'm using: - reads TSV files, and extracts meaningful data to
(String, String, String) triplets - afterwards some filtering, mapping and
grouping is performed - finally, the data is reduced and some aggregates
are calculated

I've been able to run this code with a single file (~200 MB of data),
however I get a java.lang.OutOfMemoryError: GC overhead limit exceeded
and/or a Java out of heap exception when adding more data (the application
breaks with 6GB of data but I would like to use it with 150 GB of data).

I guess I would have to tune some parameters to make this work. I would
appreciate any tips on how to approach this problem (how to debug for
memory demands). I've tried increasing the 'spark.executor.memory' and
using a smaller number of cores (the rational being that each core needs
some heap space), but this didn't solve my problems.

I don't need the solution to be very fast (it can easily run for a few
hours even days if needed). I'm also not caching any data, but just saving
them to the file system in the end. If you think it would be more feasible
to just go with the manual parallelization approach, I could do that as



View raw message