spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 孫澤恩 <>
Subject Re: partitionBy causing OOM
Date Tue, 26 Sep 2017 02:30:30 GMT
Hi, Amit,

Maybe you can change this configuration spark.sql.shuffle.partitions.
The default is 200 change this property could change the task number when you are using DataFrame

> On 26 Sep 2017, at 1:25 AM, Amit Sela <> wrote:
> I'm trying to run a simple pyspark application that reads from file (json), flattens
it (explode) and writes back to file (json) partitioned by date using DataFrameWriter.partitionBy(*cols).
> I keep getting OOMEs like:
> java.lang.OutOfMemoryError: Java heap space
> 	at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.<init>(
> 	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(
> 	at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(
> .......
> Explode could make the underlying RDD grow a lot, and maybe in an unbalanced way sometimes,
> adding to that partitioning by date (in daily ETLs for instance) would probably cause
a data skew (right?), but why am I getting OOMs? Isn't Spark supposed to spill to disk if
the underlying RDD is too big to fit in memory?
> If I'm not using "partitionBy" with the writer (still exploding) everything works fine.
> This happens both in EMR and in local (mac) pyspark/spark shell (tried both in python
and scala).
> Thanks!

View raw message