spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <r...@databricks.com>
Subject Re: multiple count distinct in SQL/DataFrame?
Date Wed, 07 Oct 2015 00:54:01 GMT
To provide more context, if we do remove this feature, the following SQL
query would throw an AnalysisException:

select count(distinct colA), count(distinct colB) from foo;

The following should still work:

select count(distinct colA) from foo;

The following should also work:

select count(distinct colA, colB) from foo;


On Tue, Oct 6, 2015 at 5:51 PM, Reynold Xin <rxin@databricks.com> wrote:

> The current implementation of multiple count distinct in a single query is
> very inferior in terms of performance and robustness, and it is also hard
> to guarantee correctness of the implementation in some of the refactorings
> for Tungsten. Supporting a better version of it is possible in the future,
> but will take a lot of engineering efforts. Most other Hadoop-based SQL
> systems (e.g. Hive, Impala) don't support this feature.
>
> As a result, we are considering removing support for multiple count
> distinct in a single query in the next Spark release (1.6). If you use this
> feature, please reply to this email. Thanks.
>
> Note that if you don't care about null values, it is relatively easy to
> reconstruct a query using joins to support multiple distincts.
>
>
>

Mime
View raw message