I have a map reduce job that reads from three logs and joins them on some key column. The underlying data is protobuf messages in sequence files. Between mappers and reducers, the underlying raw byte arrays for protobuf messages are shuffled . Roughly, for 1G input from HDFS, there is 2G data output from map phase.

I am testing spark jobs (v1.3.0) on the same input. I found that shuffle write is 3 - 4 times input size. I tried passing protobuf Message object and ArrayByte but neither gives good shuffle write output.

Is there any good practice on shuffling

* protobuf messages
* raw byte array

Chen