spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Etienne Chauchot <echauc...@apache.org>
Subject Re: CombinePerKey and GroupByKey
Date Fri, 01 Mar 2019 09:00:22 GMT
That's good to know
Thanks
Etienne
Le jeudi 28 février 2019 à 10:05 -0800, Reynold Xin a écrit :
> This should be fine. Dataset.groupByKey is a logical operation, not a physical one (as
in Spark wouldn’t always
> materialize all the groups in memory). 
> On Thu, Feb 28, 2019 at 1:46 AM Etienne Chauchot <echauchot@apache.org> wrote:
> > Hi all,
> > 
> > I'm migrating RDD pipelines to Dataset and I saw that Combine.PerKey is no more
there in the Dataset API.  So, I
> > translated it to:
> > 
> > 
> > KeyValueGroupedDataset<K, KV<K, InputT>> groupedDataset =
> >     keyedDataset.groupByKey(KVHelpers.extractKey(), EncoderHelpers.genericEncoder());
> > 
> > Dataset<Tuple2<K, OutputT>> combinedDataset =
> >     groupedDataset.agg(
> >         new Aggregator<KV<K, InputT>, AccumT, OutputT>(combineFn).toColumn());
> > 
> > I have an interrogation regarding performance : as GroupByKey is generally less
performant (entails shuffle and
> > possible OOM if a given key has a lot of data associated to it), I was wondering
if the new spark optimizer
> > translates such a DAG into a combinePerKey behind the scene. In other words, is
such a DAG going to be translated to
> > a local (or partial I don't know what terminology you use) combine and then a global
combine to avoid shuffle?
> > 
> > Thanks
> > Etienne

Mime
View raw message