spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zach Fry <>
Subject Re: Spark shuffle consolidateFiles performance degradation numbers
Date Tue, 04 Nov 2014 02:34:41 GMT
Hey Andrew, Matei,

Thanks for responding.

For some more context, we were running into "Too many open files" issues
where we were seeing this happen immediately after the Collect phase
(about 30 seconds into a run) on a decently sized dataset (14 MM rows).
The ulimit set in the spark-env was 256,000 which we believe should have
been enough, but even with it set at that number, we were still seeing
Can you comment on what a "good" ulimit should be in these cases?

We believe what might have caused this is  some process got orphaned
without cleaning up its open file handles.
However, other than anecdotal evidence and some speculation, we don't have
much evidence to expand on this further.

We were wondering if we could get some more information about how many
files get opened during a shuffle.
We discussed that it is going to be around N x M, where N is the number of
Tasks and M is the number of Reducers.
Does this sound about right?

Are there any other considerations we should be aware of when setting
consolidateFiles to True?

Zach Fry
Palantir | Developer Support Engineer <> | 650.226.6338

On 11/3/14 6:28 09PM, "Matei Zaharia" <> wrote:

>In Spark 1.1, the sort-based shuffle (spark.shuffle.manager=sort) will
>have better performance while creating fewer files. So I'd suggest trying
>that too.
>> On Nov 3, 2014, at 6:12 PM, Andrew Or <> wrote:
>> Hey Matt,
>> There's some prior work that compares consolidation performance on some
>> medium-scale workload:
>> There we noticed about 2x performance degradation in the reduce phase on
>> ext3. I am not aware of any other concrete numbers. Maybe others have
>> experiences to add.
>> -Andrew
>> 2014-11-03 17:26 GMT-08:00 Matt Cheah <>:
>>> Hi everyone,
>>> I'm running into more and more cases where too many files are opened
>>> spark.shuffle.consolidateFiles is turned off.
>>> I was wondering if this is a common scenario among the rest of the
>>> community, and if so, if it is worth considering the setting to be
>>> on by default. From the documentation, it seems like the performance
>>> be hurt on ext3 file systems. However, what are the concrete numbers of
>>> performance degradation that is seen typically? A 2x slowdown in the
>>> average job? 3x? Also, what cause the performance degradation on ext3
>>> systems specifically?
>>> Thanks,
>>> -Matt Cheah

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message