spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <>
Subject multiple count distinct in SQL/DataFrame?
Date Wed, 07 Oct 2015 00:51:17 GMT
The current implementation of multiple count distinct in a single query is
very inferior in terms of performance and robustness, and it is also hard
to guarantee correctness of the implementation in some of the refactorings
for Tungsten. Supporting a better version of it is possible in the future,
but will take a lot of engineering efforts. Most other Hadoop-based SQL
systems (e.g. Hive, Impala) don't support this feature.

As a result, we are considering removing support for multiple count
distinct in a single query in the next Spark release (1.6). If you use this
feature, please reply to this email. Thanks.

Note that if you don't care about null values, it is relatively easy to
reconstruct a query using joins to support multiple distincts.

View raw message