spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tobias Pfeiffer <...@preferred.jp>
Subject Re: *ByKey aggregations: performance + order
Date Thu, 15 Jan 2015 01:27:17 GMT
Sean,

thanks for your message.

On Wed, Jan 14, 2015 at 8:36 PM, Sean Owen <sowen@cloudera.com> wrote:

> On Wed, Jan 14, 2015 at 4:53 AM, Tobias Pfeiffer <tgp@preferred.jp> wrote:
> > OK, it seems like even on a local machine (with no network overhead), the
> > groupByKey version is about 5 times slower than any of the other
> > (reduceByKey, combineByKey etc.) functions...
>
> Even without network overhead, you're still paying the cost of setting
> up the shuffle and serialization.
>

Can I pick an appropriate scheduler some time before so that Spark "knows"
all items with the same key are on the same host? (Or enforce this?)

Thanks
Tobias

Mime
View raw message