spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <r...@databricks.com>
Subject Re: Dataset and Aggregator API pain points
Date Sun, 03 Jul 2016 01:35:37 GMT
Thanks, Koert, for the great email. They are all great points.

We should probably create an umbrella JIRA for easier tracking.

On Saturday, July 2, 2016, Koert Kuipers <koert@tresata.com> wrote:

> after working with the Dataset and Aggregator apis for a few weeks porting
> some fairly complex RDD algos (an overall pleasant experience) i wanted to
> summarize the pain points and some suggestions for improvement given my
> experience. all of these are already mentioned on mailing list or jira, but
> i figured its good to put them in one place.
> see below.
> best,
> koert
>
> *) a lot of practical aggregation functions do not have a zero. this can
> be dealt with correctly using null or None as the zero for Aggregator. in
> algebird for example this is expressed as converting an algebird.Aggregator
> (which does not have a zero) into a algebird.MonoidAggregator (which does
> have a zero, so similar to spark Aggregator) by lifting it. see:
>
> https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Aggregator.scala#L420
> something similar should be possible in spark. however currently
> Aggregator does not like its zero to be null or an Option, making this
> approach difficult. see:
> https://www.mail-archive.com/user@spark.apache.org/msg53106.html
> https://issues.apache.org/jira/browse/SPARK-15810
>
> *) KeyValueGroupedDataset.reduceGroups needs to be efficient, probably
> using an Aggregator (with null or None as the zero) under the hood. the
> current implementation does a flatMapGroups which is suboptimal.
>
> *) KeyValueGroupedDataset needs mapValues. without this porting many algos
> from RDD to Dataset is difficult and clumsy. see:
> https://issues.apache.org/jira/browse/SPARK-15780
>
> *) Aggregators need to also work within DataFrames (so
> RelationalGroupedDataset) without having to fall back on using Row objects
> as input. otherwise all code ends up being written twice, once for
> Aggregator and once for UserDefinedAggregateFunction/UDAF. this doesn't
> make sense to me. my attempt at addressing this:
> https://issues.apache.org/jira/browse/SPARK-15769
> https://github.com/apache/spark/pull/13512
>
> best,
> koert
>
>

Mime
View raw message