spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chen Song <>
Subject shuffle write size
Date Tue, 17 Mar 2015 21:23:13 GMT
I have a map reduce job that reads from three logs and joins them on some
key column. The underlying data is protobuf messages in sequence
files. Between mappers and reducers, the underlying raw byte arrays for
protobuf messages are shuffled . Roughly, for 1G input from HDFS, there is
2G data output from map phase.

I am testing spark jobs (v1.3.0) on the same input. I found that shuffle
write is 3 - 4 times input size. I tried passing protobuf Message object
and ArrayByte but neither gives good shuffle write output.

Is there any good practice on shuffling

* protobuf messages
* raw byte array


View raw message