spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Dang <>
Subject Re: groupByKey vs mapPartitions for efficient grouping within a Partition
Date Mon, 16 Jan 2017 16:58:15 GMT
groupByKey() is a wide dependency and will cause a full shuffle. It's
advised against using this transformation unless you keys are balanced
(well-distributed) and you need a full shuffle.

Otherwise, what you want is aggregateByKey() or reduceByKey() (depending on
the output). These actions are backed by comineByKey(), which can perform
map-side aggregation.


On Mon, Jan 16, 2017 at 2:21 PM, Patrick <> wrote:

> Hi,
> Does groupByKey has intelligence associated with it, such that if all the
> keys resides in the same partition, it should not do the shuffle?
> Or user should write mapPartitions( scala groupBy code).
> Which would be more efficient and what are the memory considerations?
> Thanks

View raw message