Hey all,
I observed an issue while working with Window Functions. I observed a case
where wrong results are returned from Drill.
In-case of weak schema such as parquet, Drill does not validate column
names. It is understandable when only part of the projection list in a
query. But when part of a Window Function, the results displayed are wrong,
and at times hard to identify the cause.
Two examples below:
> SELECT PERCENT_RANK() OVER (PARTITION BY s.store_sk, s.ss_customer_sk
ORDER BY s.store_sk, s.ss_customer_sk) FROM store_sales s LIMIT 2;
+---------+
| EXPR$0 |
+---------+
| 0.0 |
| 0.0 |
+---------+
2 rows selected (7.116 seconds)
SELECT CUME_DIST() OVER (PARTITION BY s.ss_store_sk ORDER BY s.ss_stoe_sk,
s.s_customr_sk) FROM store_sales s LIMIT 2;
+---------+
| EXPR$0 |
+---------+
| 1.0 |
| 1.0 |
+---------+
2 rows selected (8.361 seconds)
In both cases above, some columns do not exist.
With normal aggregate functions, it is similar to having a non-existent
column in projection list. Drill prints a column of null rows. This could
still be documented for users to expect "null" columns in results when
non-existent columns are part of a projection list.
> SELECT s.ss_store_sk, avg (ssdfd), ssdfd FROM store_sales s GROUP BY
s.ss_store_sk, ssdfd LIMIT 2;
+--------------+---------+--------+
| ss_store_sk | EXPR$1 | ssdfd |
+--------------+---------+--------+
| 10 | null | null |
| 4 | null | null |
+--------------+---------+--------+
2 rows selected (1.252 seconds)
But in case of window functions (and maybe other functions & expressions),
the results might look more real and hence difficult to identify that the
query had typos. Worse, users may trust the data returned from Drill, which
they shouldn't have.
Postgres:
# SELECT CUME_DIST() OVER (PARTITION BY s.ss_store_sk ORDER BY
s.ss_store_sk, s.ss_customer_sk) FROM store_sales s LIMIT 2;
cume_dist
----------------------
3.06415464350749e-05
3.06415464350749e-05
(2 rows)
# SELECT PERCENT_RANK() OVER (PARTITION BY s.store_sk, s.ss_customer_sk
ORDER BY s.store_sk, s.ss_customer_sk) FROM store_sales s LIMIT 2;
ERROR: column s.store_sk does not exist
LINE 1: ...ARTITION BY s.store_sk, s.ss_customer_sk ORDER BY s.store_sk...
^
I think Drill at minimum should throw a warning message when it encounters
a non-existent column. And ideally queries must fail when non-existent
columns are part of any function/expression.
I'll file a JIRA if it is agreed to be an issue.
Regards,
Abhishek
|