calcite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Hyde <jhyde.apa...@gmail.com>
Subject Re: Druid multi-value columns + Calcite
Date Thu, 16 Feb 2017 07:05:25 GMT
Standard SQL would not implicitly multiply rows. When performing GROUP BY x, and if x is a
multiset, it would combine rows whose values have the same multiset: thus MULTISET [1, 3]
= MULTISET [1, 3] but not MULTISET [1, 3, 3].

You can use the “UNNEST” operator to unpack rows. A common thing to do is to join a row
to one of its multiset columns. If we have table t (x int, y int multiset) then we could write

select x, sum(y)
from t cross join unnest t.y
group by x

Drill and Postgres allow unnest in the SELECT clause (rather than a table operator in the
FROM clause) but that feels wrong to me: what’s in the SELECT clause should not affect the
number of rows.

Julian



> On Feb 15, 2017, at 10:52 PM, Gian Merlino <gian@imply.io> wrote:
> 
> ARRAY doesn't sound right:
> 
>> Arrays can be compared for exact match with = or <>:
>> No other inequality operator is supported, and usage of arrays in just
> about any other context (including GROUP BY) is illegal.
> 
> That doesn't line up well with how multi-value columns behave in Druid.
> 
> Multiset sounds more promising. "'x' MEMBER OF a_multiset" looks like it
> matches how {"type":"selector","dimension":"a_mv_column","value":"x"} would
> behave in Druid -- our standard column-is-value filter. I don't see
> something on that page about what happens if you GROUP BY a multiset,
> though, or if any operation on multisets in SQL behaves like grouping on
> multi-value columns does in Druid. Do you happen to know off the top of
> your head how that works? If not I'll try to find some docs somewhere.
> 
> Gian
> 
> On Wed, Feb 15, 2017 at 12:10 PM, Julian Hyde <jhyde@apache.org> wrote:
> 
>> How about modeling them as a collection (ARRAY or MULTISET)? This document
>> describes them pretty well: http://farrago.sourceforge.
>> net/design/CollectionTypes.html <http://farrago.sourceforge.
>> net/design/CollectionTypes.html>
>> 
>> 
>>> On Feb 14, 2017, at 5:08 PM, Gian Merlino <gian@imply.io> wrote:
>>> 
>>> Hey Calcites,
>>> 
>>> I'm hoping for some feedback on how to best handle Druid multi-value
>>> columns in SQL.
>>> 
>>> Background: Druid has a "multi-value column" feature that lets you have
>>> string columns with more than one value per row. Details and examples of
>>> how this works are here:
>>> http://druid.io/docs/latest/querying/multi-value-dimensions.html. The
>> short
>>> version is that filtering on values works under the rule that "rows
>> match a
>>> filter if any value in a multi-value dimension matches your predicate".
>>> Grouping works by sort of causing an explosion into multiple result rows,
>>> similar to what Pig does when you flatten a bag. Selecting without
>> grouping
>>> doesn't do the exploding thing; instead it gives you the array of values.
>>> 
>>> These behaviors are intended to make multi-value columns work well to
>> hold
>>> data like "tags" or "keywords" where you might want to ask questions
>> like:
>>> "how many rows have the tag 't1'" or "count the number of distinct users
>>> for each tag".
>>> 
>>> The current Calcite-based Druid SQL stuff doesn't handle this in any way
>>> that really makes sense. The biggest issue is the expression simplifier,
>>> which would incorrect simplify "tags = 't1' AND tags = 't2'" to "false".
>>> But, it's possible for a row to match that if "tags" is multi-value.
>>> Another issue is that the type is reported as a simple "varchar" and
>> there
>>> is no indication that multiple values are possible.
>>> 
>>> I'm wondering what _would_ make the most sense in the SQL framework. The
>>> simplest thing is to keep reporting it as "varchar", adjust the
>> expression
>>> simplifying rules to be aware of the fact that some optimizations
>> shouldn't
>>> be applied to multi-value columns, and leave it at that. The behavior
>>> wouldn't be quite what you would expect for a varchar type but it should
>>> "work" in a sense.
>>> 
>>> Or we could report a different type than "varchar" and maybe do some
>> other
>>> things differently too?
>>> 
>>> Gian
>> 
>> 


Mime
View raw message