spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Or <and...@databricks.com>
Subject Re: Spark shuffle consolidateFiles performance degradation numbers
Date Tue, 04 Nov 2014 02:12:34 GMT
Hey Matt,

There's some prior work that compares consolidation performance on some
medium-scale workload:
http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf

There we noticed about 2x performance degradation in the reduce phase on
ext3. I am not aware of any other concrete numbers. Maybe others have more
experiences to add.

-Andrew

2014-11-03 17:26 GMT-08:00 Matt Cheah <mcheah@palantir.com>:

> Hi everyone,
>
> I'm running into more and more cases where too many files are opened when
> spark.shuffle.consolidateFiles is turned off.
>
> I was wondering if this is a common scenario among the rest of the
> community, and if so, if it is worth considering the setting to be turned
> on by default. From the documentation, it seems like the performance could
> be hurt on ext3 file systems. However, what are the concrete numbers of
> performance degradation that is seen typically? A 2x slowdown in the
> average job? 3x? Also, what cause the performance degradation on ext3 file
> systems specifically?
>
> Thanks,
>
> -Matt Cheah
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message