calcite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Hyde <jh...@apache.org>
Subject Re: Druid multi-value columns + Calcite
Date Wed, 15 Feb 2017 20:10:08 GMT
How about modeling them as a collection (ARRAY or MULTISET)? This document describes them pretty
well: http://farrago.sourceforge.net/design/CollectionTypes.html <http://farrago.sourceforge.net/design/CollectionTypes.html>


> On Feb 14, 2017, at 5:08 PM, Gian Merlino <gian@imply.io> wrote:
> 
> Hey Calcites,
> 
> I'm hoping for some feedback on how to best handle Druid multi-value
> columns in SQL.
> 
> Background: Druid has a "multi-value column" feature that lets you have
> string columns with more than one value per row. Details and examples of
> how this works are here:
> http://druid.io/docs/latest/querying/multi-value-dimensions.html. The short
> version is that filtering on values works under the rule that "rows match a
> filter if any value in a multi-value dimension matches your predicate".
> Grouping works by sort of causing an explosion into multiple result rows,
> similar to what Pig does when you flatten a bag. Selecting without grouping
> doesn't do the exploding thing; instead it gives you the array of values.
> 
> These behaviors are intended to make multi-value columns work well to hold
> data like "tags" or "keywords" where you might want to ask questions like:
> "how many rows have the tag 't1'" or "count the number of distinct users
> for each tag".
> 
> The current Calcite-based Druid SQL stuff doesn't handle this in any way
> that really makes sense. The biggest issue is the expression simplifier,
> which would incorrect simplify "tags = 't1' AND tags = 't2'" to "false".
> But, it's possible for a row to match that if "tags" is multi-value.
> Another issue is that the type is reported as a simple "varchar" and there
> is no indication that multiple values are possible.
> 
> I'm wondering what _would_ make the most sense in the SQL framework. The
> simplest thing is to keep reporting it as "varchar", adjust the expression
> simplifying rules to be aware of the fact that some optimizations shouldn't
> be applied to multi-value columns, and leave it at that. The behavior
> wouldn't be quite what you would expect for a varchar type but it should
> "work" in a sense.
> 
> Or we could report a different type than "varchar" and maybe do some other
> things differently too?
> 
> Gian


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message