spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Dang <nam...@gmail.com>
Subject Re: groupByKey vs mapPartitions for efficient grouping within a Partition
Date Mon, 16 Jan 2017 16:58:15 GMT
groupByKey() is a wide dependency and will cause a full shuffle. It's
advised against using this transformation unless you keys are balanced
(well-distributed) and you need a full shuffle.

Otherwise, what you want is aggregateByKey() or reduceByKey() (depending on
the output). These actions are backed by comineByKey(), which can perform
map-side aggregation.

-------
Regards,
Andy

On Mon, Jan 16, 2017 at 2:21 PM, Patrick <titlibatali@gmail.com> wrote:

> Hi,
>
> Does groupByKey has intelligence associated with it, such that if all the
> keys resides in the same partition, it should not do the shuffle?
>
> Or user should write mapPartitions( scala groupBy code).
>
> Which would be more efficient and what are the memory considerations?
>
>
> Thanks
>
>
>
>

Mime
View raw message