calcite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jesus Camacho Rodriguez <jcamachorodrig...@hortonworks.com>
Subject Changes for Grouping Sets in Calcite
Date Tue, 08 Dec 2015 13:50:18 GMT
Hi,

I wanted to open a thread with some conversation about changes in Grouping Sets implementation
in Calcite.

Grouping sets are currently implemented in Calcite using a bit to indicate each of the grouping
columns. For instance, consider the following group by clause:

...
GROUP BY CUBE (a, b)
...

The generated Aggregate operator in Calcite will have a row schema consisting of [a, b, GROUPING(a),
GROUPING(b)], where GROUPING(x) represents whether x is participating in the group by clause.
In particular, GROUPING returns 1 for aggregated or 0 for not aggregated in the result set.

In contrast, Hive's implementation stores a single number corresponding to the GROUPING bit
vector associated with a row (this is the result of the GROUPING_ID function in RDBMS such
as MSSQLServer, Oracle, etc). Thus, the row schema of the Aggregate operator is [a, b, GROUPING_ID(a,b)].

This difference is creating a mismatch between Calcite and Hive. Till now, we have solved
it in the Hive side: we created our own GROUPING_ID function applied over those columns. However,
we have some issues related to predicates pushdown, constant propagation, etc., that we need
to continue solving as e.g. new rules are added to our optimizer. In short, this is making
the code on the Hive side harder and harder to maintain.

We were wondering if it would make sense to change the implementation on the Calcite side.
Which systems are using Calcite's grouping sets currently? Are you in favor/against this change?
Why?

Thanks,
Jesús

Mime
View raw message