spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniele Foroni <daniele.for...@gmail.com>
Subject [SparkSQL] Count Distinct issue
Date Fri, 14 Sep 2018 18:54:16 GMT
Hi all,

I am having some troubles in doing a count distinct over multiple columns.
This is an example of my data:
+----+----+----+---+
|a   |b   |c   |d  |
+----+----+----+---+
|null|null|null|1  |
|null|null|null|2  |
|null|null|null|3  |
|null|null|null|4  |
|null|null|null|5  |
|null|null|null|6  |
|null|null|null|7  |
+----+----+----+---+
And my code:
val df: Dataset[Row] = …
val cols: List[Column] = df.columns.map(col).toList
df.agg(countDistinct(cols.head, cols.tail: _*))

So, in the example above, if I count the distinct “rows” I obtain 7 as result as expected
(since the “d" column changes for every row).
However, with more columns (16) in EXACTLY the same situation (one incremental column and
15 columns filled with nulls) the result is 0.

I don’t understand why I am experiencing this problem.
Any solution?

Thanks,
---
Daniele


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message