spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: *ByKey aggregations: performance + order
Date Thu, 15 Jan 2015 09:19:46 GMT
I'm interested too and don't know for sure but I do not think this case is
optimized this way. However if you know your keys aren't split across
partitions and you have small enough partitions you can implement the same
grouping with mapPartitions and Scala.
On Jan 15, 2015 1:27 AM, "Tobias Pfeiffer" <tgp@preferred.jp> wrote:

> Sean,
>
> thanks for your message.
>
> On Wed, Jan 14, 2015 at 8:36 PM, Sean Owen <sowen@cloudera.com> wrote:
>
>> On Wed, Jan 14, 2015 at 4:53 AM, Tobias Pfeiffer <tgp@preferred.jp>
>> wrote:
>> > OK, it seems like even on a local machine (with no network overhead),
>> the
>> > groupByKey version is about 5 times slower than any of the other
>> > (reduceByKey, combineByKey etc.) functions...
>>
>> Even without network overhead, you're still paying the cost of setting
>> up the shuffle and serialization.
>>
>
> Can I pick an appropriate scheduler some time before so that Spark "knows"
> all items with the same key are on the same host? (Or enforce this?)
>
> Thanks
> Tobias
>
>
>

Mime
View raw message