calcite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gian Merlino <g...@imply.io>
Subject Re: Druid multi-value columns + Calcite
Date Thu, 16 Feb 2017 22:27:45 GMT
Playing with Postgres a bit it looks like its array types can behave in
Druid-y ways. The three queries below are what Druid would do with "select"
and "groupBy" queries against a multi-value column. I'm not sure how
standard this behavior is or if Calcite does the same thing, but this seems
like a promising tree to bark up so far.

gian=# \d tbl
       Table "public.tbl"
 Column |   Type    | Modifiers
--------+-----------+-----------
 x      | integer   |
 y      | integer   |
 zs     | integer[] |

gian=# select * from tbl;
 x |  y  |   zs
---+-----+---------
 1 | 100 | {1,2,3}
 2 | 200 | {1,3}
 3 | 300 | {2,3}
 4 | 400 | {1,3}
(4 rows)

gian=# select unnest(zs), count(*) from tbl where zs @> ARRAY[2] group by
unnest(zs);
 unnest | count
--------+-------
      1 |     1
      2 |     2
      3 |     2
(3 rows)

gian=# select z, count(*) from tbl cross join unnest(tbl.zs) as z where zs
@> ARRAY[2] group by z order by z;
 z | count
---+-------
 1 |     1
 2 |     2
 3 |     2
(3 rows)

Gian

On Wed, Feb 15, 2017 at 11:05 PM, Julian Hyde <jhyde.apache@gmail.com>
wrote:

> Standard SQL would not implicitly multiply rows. When performing GROUP BY
> x, and if x is a multiset, it would combine rows whose values have the same
> multiset: thus MULTISET [1, 3] = MULTISET [1, 3] but not MULTISET [1, 3, 3].
>
> You can use the “UNNEST” operator to unpack rows. A common thing to do is
> to join a row to one of its multiset columns. If we have table t (x int, y
> int multiset) then we could write
>
> select x, sum(y)
> from t cross join unnest t.y
> group by x
>
> Drill and Postgres allow unnest in the SELECT clause (rather than a table
> operator in the FROM clause) but that feels wrong to me: what’s in the
> SELECT clause should not affect the number of rows.
>
> Julian
>
>
>
> > On Feb 15, 2017, at 10:52 PM, Gian Merlino <gian@imply.io> wrote:
> >
> > ARRAY doesn't sound right:
> >
> >> Arrays can be compared for exact match with = or <>:
> >> No other inequality operator is supported, and usage of arrays in just
> > about any other context (including GROUP BY) is illegal.
> >
> > That doesn't line up well with how multi-value columns behave in Druid.
> >
> > Multiset sounds more promising. "'x' MEMBER OF a_multiset" looks like it
> > matches how {"type":"selector","dimension":"a_mv_column","value":"x"}
> would
> > behave in Druid -- our standard column-is-value filter. I don't see
> > something on that page about what happens if you GROUP BY a multiset,
> > though, or if any operation on multisets in SQL behaves like grouping on
> > multi-value columns does in Druid. Do you happen to know off the top of
> > your head how that works? If not I'll try to find some docs somewhere.
> >
> > Gian
> >
> > On Wed, Feb 15, 2017 at 12:10 PM, Julian Hyde <jhyde@apache.org> wrote:
> >
> >> How about modeling them as a collection (ARRAY or MULTISET)? This
> document
> >> describes them pretty well: http://farrago.sourceforge.
> >> net/design/CollectionTypes.html <http://farrago.sourceforge.
> >> net/design/CollectionTypes.html>
> >>
> >>
> >>> On Feb 14, 2017, at 5:08 PM, Gian Merlino <gian@imply.io> wrote:
> >>>
> >>> Hey Calcites,
> >>>
> >>> I'm hoping for some feedback on how to best handle Druid multi-value
> >>> columns in SQL.
> >>>
> >>> Background: Druid has a "multi-value column" feature that lets you have
> >>> string columns with more than one value per row. Details and examples
> of
> >>> how this works are here:
> >>> http://druid.io/docs/latest/querying/multi-value-dimensions.html. The
> >> short
> >>> version is that filtering on values works under the rule that "rows
> >> match a
> >>> filter if any value in a multi-value dimension matches your predicate".
> >>> Grouping works by sort of causing an explosion into multiple result
> rows,
> >>> similar to what Pig does when you flatten a bag. Selecting without
> >> grouping
> >>> doesn't do the exploding thing; instead it gives you the array of
> values.
> >>>
> >>> These behaviors are intended to make multi-value columns work well to
> >> hold
> >>> data like "tags" or "keywords" where you might want to ask questions
> >> like:
> >>> "how many rows have the tag 't1'" or "count the number of distinct
> users
> >>> for each tag".
> >>>
> >>> The current Calcite-based Druid SQL stuff doesn't handle this in any
> way
> >>> that really makes sense. The biggest issue is the expression
> simplifier,
> >>> which would incorrect simplify "tags = 't1' AND tags = 't2'" to
> "false".
> >>> But, it's possible for a row to match that if "tags" is multi-value.
> >>> Another issue is that the type is reported as a simple "varchar" and
> >> there
> >>> is no indication that multiple values are possible.
> >>>
> >>> I'm wondering what _would_ make the most sense in the SQL framework.
> The
> >>> simplest thing is to keep reporting it as "varchar", adjust the
> >> expression
> >>> simplifying rules to be aware of the fact that some optimizations
> >> shouldn't
> >>> be applied to multi-value columns, and leave it at that. The behavior
> >>> wouldn't be quite what you would expect for a varchar type but it
> should
> >>> "work" in a sense.
> >>>
> >>> Or we could report a different type than "varchar" and maybe do some
> >> other
> >>> things differently too?
> >>>
> >>> Gian
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message