spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antonio Piccolboni <>
Subject Re: groupByKey() and keys with many values
Date Tue, 08 Sep 2015 16:51:55 GMT
You may also consider selecting distinct keys and fetching from database
first, then join on key with values. This in case Sean's approach is not
viable -- in case you need to have the DB data before the first reduce
call. By not revealing your problem, you are forcing us to make guesses,
which are less useful. Imagine you want to compute a binning of the values
on a per key basis. The bin definitions are in the database. Then the
reduce would be updating counts per bin.  You could let the reduce
initialize the bin counts from DB when empty. This will result in multiple
database accesses and connections per key, and the higher the degree of
parallelism, the bigger the cost (see this
<> elementary
example), which is something you should avoid if you want to write code
with some durability to it. If you use the join approach, you can select
the keys, unique them and perform data base access to obtain bin defs. Now
join the data file with the bin file on key. Then pass this through a
reduceByKey to update the bin counts. Different application, you want to
compute max min values per key and want to compare with previously recored
max min, then store the overall max min. Then you don't need the data based
values during the reduce. You just fetch them in the foreachPartition,
before each write.

As far as the DB writes,  remember spark can retry a computation, so your
writes have to be idempotent (see this thread
<!topic/spark-users/oM-IzQs0Z2s>, in which
Reynold is a bit optimistic about failures than I am comfortable with, but
who am I to question Reynold?)

On Tue, Sep 8, 2015 at 12:53 AM Sean Owen <> wrote:

> I think groupByKey is intended for cases where you do want the values
> in memory; for one-pass use cases, it's more efficient to use
> reduceByKey, or aggregateByKey if lower-level operations are needed.
> For your case, you probably want to do you reduceByKey, then perform
> the expensive per-key lookups once per key. You also probably want to
> do this in foreachPartition, not foreach, in order to pay DB
> connection costs just once per partition.
> On Tue, Sep 8, 2015 at 7:20 AM, kaklakariada <>
> wrote:
> > Hi Antonio!
> >
> > Thank you very much for your answer!
> > You are right in that in my case the computation could be replaced by a
> > reduceByKey. The thing is that my computation also involves database
> > queries:
> >
> > 1. Fetch key-specific data from database into memory. This is expensive
> and
> > I only want to do this once for a key.
> > 2. Process each value using this data and update the common data
> > 3. Store modified data to database. Here it is important to write all
> data
> > for a key in one go.
> >
> > Is there a pattern how to implement something like this with reduceByKey?
> >
> > Out of curiosity: I understand why you want to discourage people from
> using
> > groupByKey. But is there a technical reason why the Iterable is
> implemented
> > the way it is?
> >
> > Kind regards,
> > Christoph.
> >
> >
> >
> > --
> > View this message in context:
> > Sent from the Apache Spark Developers List mailing list archive at
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> > For additional commands, e-mail:
> >
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

View raw message