spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 孫澤恩 <gn00710...@gmail.com>
Subject Re: partitionBy causing OOM
Date Tue, 26 Sep 2017 02:30:30 GMT
Hi, Amit,

Maybe you can change this configuration spark.sql.shuffle.partitions.
The default is 200 change this property could change the task number when you are using DataFrame
API.

> On 26 Sep 2017, at 1:25 AM, Amit Sela <amit.sela@venmo.com> wrote:
> 
> I'm trying to run a simple pyspark application that reads from file (json), flattens
it (explode) and writes back to file (json) partitioned by date using DataFrameWriter.partitionBy(*cols).
> 
> I keep getting OOMEs like:
> java.lang.OutOfMemoryError: Java heap space
> 	at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.<init>(UnsafeSorterSpillWriter.java:46)
> 	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:206)
> 	at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
> .......
> 
> Explode could make the underlying RDD grow a lot, and maybe in an unbalanced way sometimes,
 
> adding to that partitioning by date (in daily ETLs for instance) would probably cause
a data skew (right?), but why am I getting OOMs? Isn't Spark supposed to spill to disk if
the underlying RDD is too big to fit in memory?
> 
> If I'm not using "partitionBy" with the writer (still exploding) everything works fine.
> 
> This happens both in EMR and in local (mac) pyspark/spark shell (tried both in python
and scala).
> 
> Thanks!


Mime
View raw message