spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <matei.zaha...@gmail.com>
Subject Re: pyspark join crash
Date Wed, 04 Jun 2014 20:42:27 GMT
In PySpark, the data processed by each reduce task needs to fit in memory within the Python
process, so you should use more tasks to process this dataset. Data is spilled to disk across
tasks.

I’ve created https://issues.apache.org/jira/browse/SPARK-2021 to track this — it’s something
we’ve been meaning to look at soon.

Matei

On Jun 4, 2014, at 8:23 AM, Brad Miller <bmiller1@eecs.berkeley.edu> wrote:

> Hi All,
> 
> I have experienced some crashing behavior with join in pyspark.  When I attempt a join
with 2000 partitions in the result, the join succeeds, but when I use only 200 partitions
in the result, the join fails with the message "Job aborted due to stage failure: Master removed
our application: FAILED".
> 
> The crash always occurs at the beginning of the shuffle phase.  Based on my observations,
it seems like the workers in the read phase may be fetching entire blocks from the write phase
of the shuffle rather than just the records necessary to compose the partition the reader
is responsible for.  Hence, when there are fewer partitions in the read phase, the worker
is likely to need a record from each of the write partitions and consequently attempts to
load the entire data set into the memory of a single machine (which then causes the out of
memory crash I observe in /var/log/syslog).
> 
> Can anybody confirm if this is the behavior of pyspark?  I am glad to supply additional
details about my observed behavior upon request.
> 
> best,
> -Brad


Mime
View raw message