spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lee Becker <lee.bec...@hapara.com>
Subject countDistinct, partial aggregates and Spark 2.0
Date Fri, 12 Aug 2016 17:55:14 GMT
Hi everyone,

I've started experimenting with my codebase to see how much work I will
need to port it from 1.6.1 to 2.0.0.  In regressing some of my dataframe
transforms, I've discovered I can no longer pair a countDistinct with a
collect_set in the same aggregation.

Consider:

val df = sc.parallelize(Array(("a", "a"), ("b", "c"), ("c",
"a"))).toDF("x", "y")
val grouped = df.groupBy($"x").agg(countDistinct($"y"), collect_set($"y"))

When it comes time to execute (via collect or show).  I get the following
error:

*java.lang.RuntimeException: Distinct columns cannot exist in Aggregate
> operator containing aggregate functions which don't support partial
> aggregation.*


I never encountered this behavior in previous Spark versions.  Are there
workarounds that don't require computing each aggregation separately and
joining later?  Is there a partial aggregation version of collect_set?

Thanks,
Lee

Mime
View raw message