spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandy Ryza <>
Subject Re: better compression codecs for shuffle blocks?
Date Mon, 14 Jul 2014 22:51:43 GMT
Often the shuffle is bound by writes to disk, so even if disks have enough
space to store the uncompressed data, the shuffle can complete faster by
writing less data.

This isn't a big help in the short term, but if we switch to a sort-based
shuffle, we'll only need a single LZFOutputStream per map task.

On Mon, Jul 14, 2014 at 3:30 PM, Stephen Haberman <> wrote:

> Just a comment from the peanut gallery, but these buffers are a real
> PITA for us as well. Probably 75% of our non-user-error job failures
> are related to them.
> Just naively, what about not doing compression on the fly? E.g. during
> the shuffle just write straight to disk, uncompressed?
> For us, we always have plenty of disk space, and if you're concerned
> about network transmission, you could add a separate compress step
> after the blocks have been written to disk, but before being sent over
> the wire.
> Granted, IANAE, so perhaps this is a bad idea; either way, awesome to
> see work in this area!
> - Stephen

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message