calcite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gian Merlino <g...@imply.io>
Subject Re: DISTINCT ON clause
Date Wed, 07 Mar 2018 23:50:48 GMT
In Druid I was hoping to eventually deal with this using FIRST and LAST
aggregators. So the original query would be:

  SELECT a, FIRST(b), FIRST(c) FROM T GROUP BY a

However, we have only implemented FIRST and LAST for numeric types, not for
strings or arrays yet. And we also haven't wired it into Druid SQL yet
(they are available in Druid's native query language only). Also in the
Druid the FIRST/LAST aggregators pick by first/last values by timestamp,
not by the ORDER BY clause of the query in general.

The idea would be you use them either when you really do want the
first/last value to be seen (by time), or when you just want _any_ value
and you don't care which one.

Gian

On Wed, Mar 7, 2018 at 6:16 PM, Julian Hyde <jhyde@apache.org> wrote:

> I always wished there was a variant of LIMIT that operated per key. E.g.
> return the youngest 2 employees in each department:
>
> SELECT *
> FROM Amp
> ORDER BY deptno LIMIT 2, birthdate DESC
>
> (This is not standard syntax.)
>
> DISTINCT ON is the special case LIMIT 1. But LIMIT 1 would not mess with
> SQL evaluation order the way DISTINCT ON does.
>
> Also, per-key limit is something that a sort-limit physical operator could
> usefully and efficiently do.
>
> > On Mar 4, 2018, at 8:09 PM, Aman Sinha <amansinha@apache.org> wrote:
> >
> > So far, I haven't had much success.. other databases such as Oracle, SQL
> > Server and DB2 don't seem to support this at least based on their
> published
> > SQL reference guides.
> >
> > One way people work around this is to use ROW_NUMBER() window function to
> > achieve similar results.
> > e.g
> >     SELECT a, b, c FROM (SELECT a, b, c , ROW_NUMBER() OVER (PARTITION BY
> > a ORDER BY <some column>) as rownum FROM t) WHERE rownum = 1;
> >
> > Regarding the aggregate functions allowing arbitrary values in a group, I
> > am not aware of databases that do that.  I am not sure how to verify this
> > though.
> >
> > -Aman
> >
> > On Thu, Feb 22, 2018 at 10:06 PM, Aman Sinha <amansinha@apache.org>
> wrote:
> >
> >> I will try to find out more about this during the next few days....what
> do
> >> Oracle, SQL Server, DB2 support.
> >>
> >> On Wed, Feb 21, 2018 at 2:12 PM, Julian Hyde <jhyde@apache.org> wrote:
> >>
> >>> Can you do some research, and see if any other databases do anything
> >>> similar? Since Postgres isn’t standard, maybe we can improve upon it a
> bit.
> >>>
> >>> Also, do any databases have an aggregate function that takes an
> arbitrary
> >>> value in the group, or asserts that all values are the same? We would
> use
> >>> that when we translate the syntactic sugar to algebra.
> >>>
> >>> Julian
> >>>
> >>>
> >>>> On Feb 21, 2018, at 2:02 PM, Aman Sinha <amansinha@apache.org>
wrote:
> >>>>
> >>>> It is a useful functionality, especially since the SELECT list can
> >>> contain
> >>>> mix of primitive types and array or map type columns.
> >>>> I do see your point about the Postgres semantics of ORDER BY vs
> DISTINCT
> >>>> ON.
> >>>> I don't like the hidden semantics of MySQL either.
> >>>> Perhaps the compromise solution is to introduce functions such as
> >>>> FIRST_ROW() aggregation function (similar to lead/lag window
> functions)
> >>> and
> >>>> use GROUP BY.
> >>>>
> >>>> -Aman
> >>>>
> >>>> On Wed, Feb 21, 2018 at 1:31 PM, Julian Hyde <jhyde@apache.org>
> wrote:
> >>>>
> >>>>> I can see that it would be useful. But there are a couple of things
> >>> about
> >>>>> it that are messy.
> >>>>>
> >>>>> I think the syntax is a bit clunky because it uses parentheses;
a
> query
> >>>>> would be difficult to read if people would like multiple columns,
> >>>>> expressions, and aliases.
> >>>>>
> >>>>> I also think the semantics are messy. If you read
> >>>>> https://www.postgresql.org/docs/9.5/static/sql-select.html#
> >>> SQL-DISTINCT <
> >>>>> https://www.postgresql.org/docs/9.5/static/sql-select.html#
> >>> SQL-DISTINCT>
> >>>>> you will see that “DISTINCT ON” is evaluated after the ORDER
BY
> clause,
> >>>>> whereas regular DISTINCT is evaluated before the ORDER BY clause.
So
> >>> there
> >>>>> will be a bizarre interaction if DISTINCT ON is used with UNION
and
> >>> ORDER
> >>>>> BY.
> >>>>>
> >>>>> By the way, good ol’ MySQL doesn’t have this problem; you can
just
> >>> write
> >>>>>
> >>>>> SELECT a, b, c FROM t GROUP BY a
> >>>>>
> >>>>> and it just picks the first value of b and c. This “feature”
is one
> of
> >>> my
> >>>>> least favorite things about MySQL, so let’s not emulate it. :)
> >>>>>
> >>>>> Julian
> >>>>>
> >>>>>
> >>>>>> On Feb 21, 2018, at 12:44 PM, Aman Sinha <amansinha@apache.org>
> >>> wrote:
> >>>>>>
> >>>>>> The DISTINCT 'ON' clause is not supported but I am trying to
see if
> >>> there
> >>>>>> are other people who have run into this.  One of the use cases
I
> have
> >>>>>> intuitively maps to something like this:
> >>>>>>
> >>>>>> SELECT DISTINCT *ON (a)*,  b, c  FROM T
> >>>>>>
> >>>>>> Here  suppose 'a' is an INT  and b, c are some complex types
such as
> >>>>>> array.  In my example, the values of  b and c happen to be the
same
> in
> >>>>> all
> >>>>>> rows belonging to the same group of 'a', so I just want the
first
> row.
> >>>>>> Since these are arrays, I cannot use MIN(b) GROUP BY a .   Other
> >>>>>> alternative is to create a new aggregate function that picks
the
> first
> >>>>>> value but it makes the syntax verbose especially with large
number
> of
> >>>>> such
> >>>>>> columns.
> >>>>>>
> >>>>>> Incidentally, Postgres supports this [1]
> >>>>>>
> >>>>>> [1]
> >>>>>> https://www.postgresql.org/docs/current/static/sql-
> >>>>> select.html#SQL-DISTINCT
> >>>>>>
> >>>>>>
> >>>>>> -Aman
> >>>>>
> >>>>>
> >>>
> >>>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message