calcite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gian Merlino <g...@imply.io>
Subject Re: Druid multi-value columns + Calcite
Date Thu, 16 Feb 2017 06:52:47 GMT
ARRAY doesn't sound right:

> Arrays can be compared for exact match with = or <>:
> No other inequality operator is supported, and usage of arrays in just
about any other context (including GROUP BY) is illegal.

That doesn't line up well with how multi-value columns behave in Druid.

Multiset sounds more promising. "'x' MEMBER OF a_multiset" looks like it
matches how {"type":"selector","dimension":"a_mv_column","value":"x"} would
behave in Druid -- our standard column-is-value filter. I don't see
something on that page about what happens if you GROUP BY a multiset,
though, or if any operation on multisets in SQL behaves like grouping on
multi-value columns does in Druid. Do you happen to know off the top of
your head how that works? If not I'll try to find some docs somewhere.

Gian

On Wed, Feb 15, 2017 at 12:10 PM, Julian Hyde <jhyde@apache.org> wrote:

> How about modeling them as a collection (ARRAY or MULTISET)? This document
> describes them pretty well: http://farrago.sourceforge.
> net/design/CollectionTypes.html <http://farrago.sourceforge.
> net/design/CollectionTypes.html>
>
>
> > On Feb 14, 2017, at 5:08 PM, Gian Merlino <gian@imply.io> wrote:
> >
> > Hey Calcites,
> >
> > I'm hoping for some feedback on how to best handle Druid multi-value
> > columns in SQL.
> >
> > Background: Druid has a "multi-value column" feature that lets you have
> > string columns with more than one value per row. Details and examples of
> > how this works are here:
> > http://druid.io/docs/latest/querying/multi-value-dimensions.html. The
> short
> > version is that filtering on values works under the rule that "rows
> match a
> > filter if any value in a multi-value dimension matches your predicate".
> > Grouping works by sort of causing an explosion into multiple result rows,
> > similar to what Pig does when you flatten a bag. Selecting without
> grouping
> > doesn't do the exploding thing; instead it gives you the array of values.
> >
> > These behaviors are intended to make multi-value columns work well to
> hold
> > data like "tags" or "keywords" where you might want to ask questions
> like:
> > "how many rows have the tag 't1'" or "count the number of distinct users
> > for each tag".
> >
> > The current Calcite-based Druid SQL stuff doesn't handle this in any way
> > that really makes sense. The biggest issue is the expression simplifier,
> > which would incorrect simplify "tags = 't1' AND tags = 't2'" to "false".
> > But, it's possible for a row to match that if "tags" is multi-value.
> > Another issue is that the type is reported as a simple "varchar" and
> there
> > is no indication that multiple values are possible.
> >
> > I'm wondering what _would_ make the most sense in the SQL framework. The
> > simplest thing is to keep reporting it as "varchar", adjust the
> expression
> > simplifying rules to be aware of the fact that some optimizations
> shouldn't
> > be applied to multi-value columns, and leave it at that. The behavior
> > wouldn't be quite what you would expect for a varchar type but it should
> > "work" in a sense.
> >
> > Or we could report a different type than "varchar" and maybe do some
> other
> > things differently too?
> >
> > Gian
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message