spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: dealing with large values in kv pairs
Date Mon, 10 Nov 2014 08:40:58 GMT
You are suggesting that the String concatenation is slow? It probably is
because of all the allocation.

Consider foldByKey instead which starts with an empty StringBuilder as its
zero value. This will build up the result far more efficiently.
On Nov 10, 2014 8:37 AM, "YANG Fan" <iddmbr@gmail.com> wrote:

> Hi,
>
> I've got a huge list of key-value pairs, where the key is an integer and
> the value is a long string(around 1Kb). I want to concatenate the strings
> with the same keys.
>
> Initially I did something like: pairs.reduceByKey((a, b) => a+" "+b)
>
> Then tried to save the result to HDFS. But it was extremely slow. I had to
> kill the job at last.
>
> I guess it's because the value part is too big and it slows down the
> shuffling phase. So I tried to use sortByKey before doing reduceByKey.
> sortByKey is very fast, and it's also fast when writing the result back to
> HDFS. But when I did reduceByKey, it was as slow as before.
>
> How can I make this simple operation faster?
>
> Thanks,
> Fan
>

Mime
View raw message