spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lee Becker <lee.bec...@hapara.com>
Subject Re: countDistinct, partial aggregates and Spark 2.0
Date Fri, 12 Aug 2016 18:14:21 GMT
On Fri, Aug 12, 2016 at 11:55 AM, Lee Becker <lee.becker@hapara.com> wrote:

> val df = sc.parallelize(Array(("a", "a"), ("b", "c"), ("c",
> "a"))).toDF("x", "y")
> val grouped = df.groupBy($"x").agg(countDistinct($"y"), collect_set($"y"))
>

This workaround executes with no exceptions:
val grouped = df.groupBy($"x").agg(size(collect_set($"y")),
collect_set($"y"))

In this example countDistinct and collect_set are running on the same
column and thus the result of countDistinct is essentially redundant.
Assuming they were running on different columns (say there was column 'z'
too), is there anything distinct computationally between countDistinct and
size(collect_set(...))?

-- 
*hapara* ‚óŹ Making Learning Visible
1877 Broadway Street, Boulder, CO 80302
(Google Voice): +1 720 335 5332
www.hapara.com   Twitter: @hapara_team <http://twitter.com/hapara_team>

Mime
View raw message