spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From YANG Fan <>
Subject dealing with large values in kv pairs
Date Mon, 10 Nov 2014 08:34:22 GMT

I've got a huge list of key-value pairs, where the key is an integer and
the value is a long string(around 1Kb). I want to concatenate the strings
with the same keys.

Initially I did something like: pairs.reduceByKey((a, b) => a+" "+b)

Then tried to save the result to HDFS. But it was extremely slow. I had to
kill the job at last.

I guess it's because the value part is too big and it slows down the
shuffling phase. So I tried to use sortByKey before doing reduceByKey.
sortByKey is very fast, and it's also fast when writing the result back to
HDFS. But when I did reduceByKey, it was as slow as before.

How can I make this simple operation faster?


View raw message