spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tobias Pfeiffer <...@preferred.jp>
Subject *ByKey aggregations: performance + order
Date Wed, 14 Jan 2015 03:11:34 GMT
Hi,

I have an RDD[(Long, MyData)] where I want to compute various functions on
lists of MyData items with the same key (this will in general be a rather
short lists, around 10 items per key).

Naturally I was thinking of groupByKey() but was a bit intimidated by the
warning: "This operation may be very expensive. If you are grouping in
order to perform an aggregation (such as a sum or average) over each key,
using PairRDDFunctions.aggregateByKey or PairRDDFunctions.reduceByKey will
provide much better performance."

Now I don't know (yet) if all of the functions I want to compute can be
expressed in this way and I was wondering about *how much* more expensive
we are talking about. Say I have something like

  rdd.zipWithIndex.map(kv => (kv._2/10, kv._1)).groupByKey(),

i.e. items that will be grouped will 99% live in the same partition (do
they?), does this change the performance?

Also, if my operations depend on the order in the original RDD (say, string
concatenation), how could I make sure the order of the original RDD is
respected?

Thanks
Tobias

Mime
View raw message