spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: groupByKey() and keys with many values
Date Tue, 08 Sep 2015 07:53:00 GMT
I think groupByKey is intended for cases where you do want the values
in memory; for one-pass use cases, it's more efficient to use
reduceByKey, or aggregateByKey if lower-level operations are needed.

For your case, you probably want to do you reduceByKey, then perform
the expensive per-key lookups once per key. You also probably want to
do this in foreachPartition, not foreach, in order to pay DB
connection costs just once per partition.

On Tue, Sep 8, 2015 at 7:20 AM, kaklakariada <> wrote:
> Hi Antonio!
> Thank you very much for your answer!
> You are right in that in my case the computation could be replaced by a
> reduceByKey. The thing is that my computation also involves database
> queries:
> 1. Fetch key-specific data from database into memory. This is expensive and
> I only want to do this once for a key.
> 2. Process each value using this data and update the common data
> 3. Store modified data to database. Here it is important to write all data
> for a key in one go.
> Is there a pattern how to implement something like this with reduceByKey?
> Out of curiosity: I understand why you want to discourage people from using
> groupByKey. But is there a technical reason why the Iterable is implemented
> the way it is?
> Kind regards,
> Christoph.
> --
> View this message in context:
> Sent from the Apache Spark Developers List mailing list archive at
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message