I have experienced some crashing behavior with join in pyspark. When I attempt a join with 2000 partitions in the result, the join succeeds, but when I use only 200 partitions in the result, the join fails with the message "Job aborted due to stage failure: Master removed our application: FAILED".
The crash always occurs at the beginning of the shuffle phase. Based on my observations, it seems like the workers in the read phase may be fetching entire blocks from the write phase of the shuffle rather than just the records necessary to compose the partition the reader is responsible for. Hence, when there are fewer partitions in the read phase, the worker is likely to need a record from each of the write partitions and consequently attempts to load the entire data set into the memory of a single machine (which then causes the out of memory crash I observe in /var/log/syslog).
Can anybody confirm if this is the behavior of pyspark? I am glad to supply additional details about my observed behavior upon request.