drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Hyde <julianh...@gmail.com>
Subject Re: updates to logical plan spec
Date Sat, 13 Oct 2012 09:03:00 GMT
I guess I've been thinking about something a little different. I've been thinking about what
would be an appropriate algebra to internally represent, manipulate, and optimize Drill queries.


My conclusion is that the best candidate is relational algebra, augmented with data types
that allow collections of nested records and with "explode" and "implode" operators. ("explode"
takes a record with nested records and converts it to a sequence of flat records, plus a "location"
value that indicates how the nested record fits into the parent; "implode" is the inverse:
it takes a sequence of flat records with a "location" field and converts into a nested record.)
Here is how I came to that conclusion.

The algebra is lower level than DrQL: viz, it would have fewer operators, and the operators
would be easier to reason about and to write transformation rules for. (Anyone who thinks
that DrQL's 'where' operator is straightforward should ponder why BigQuery will sometimes
give the error "Cannot query the cross product of repeated fields".)

The algebra is (probably) higher level than what Ted calls a "logical plan". His operators
produce two outputs, and while that makes perfect sense for physical operators, it is difficult
to reason about.

Here are a few reasons why I consider DrQL to be a less clean model than the relational model.
As I've said before, the "where" operator has a much more complex behavior than SQL's where.
It is best understood as decomposing records, applying a filtering predicate, then re-composing
the fragments of the row that made it through the filter. The "within" clause is a nice shorthand,
but is too limited to be considered a full operator. Trees (collections within rows) are similar
to relations (collections of rows) but are handled using different operators. If the "tree"
model was as powerful as advertised then we wouldn't need the concept of "relation" at all.

That doesn't mean that DrQL is not a good query language. It seems to be concise, and users
learn it quickly and like it. Syntactic sugar operators like "within" is totally appropriate
(just like syntactic sugar "select distinct" and "having" in SQL).

To fix up DrQL's "where" operator, we convert it to "explode" followed by "filter" followed
by "implode". To fix up aggregate "within", we apply "explode" then aggregate. We find that
we never need to operate on trees. If we need to operate on a tree, we explode into several
records, apply relational operators on those records, and if necessary implode back again.
We're operating in relational algebra.

This is good news, because relational algebra is well behaved and well understood.

And by the way, even if the algebra is about exploded sets of flat records, there's no reason
that the physical operators can't operate on tree-structured records. We could recognize explode-followed-by-filter-followed-by-implode
and implement an operator that does precisely the same as a DrQL "where" clause.

Am I over-engineering here? It's possible. Maybe Drill doesn't need query optimization. Maybe
queries can go straight from a DrQL parse tree to a DAG of operators using a straightforward
mapping. But I'll argue that many people will come to Drill with SQL queries, or queries very
similar to SQL, data sets with minimal nesting, and will be saddened when Drill can't execute
their queries. This particular user kicked the tires, was impressed with the speed of the
car, but was disappointed that he couldn't drive it where he wanted to go: http://cwebbbi.wordpress.com/2012/05/20/a-look-at-google-bigquery/.

Julian

On Oct 12, 2012, at 7:37 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> I talked to Jason some more.  He had some very good suggestions.
> 
> a) some operators need to have multiple outputs.  For instance, the group
> operator needs to output the main data stream and a reference to the
> grouped field
> 
> b) what Julian was calling nest/unnest is more naturally called explode and
> flatten.  The idea is that some field has a list-like value and the output
> will be each of those values.  Actually, there are two outputs.  One is the
> original input and the second is the explode sequence.  This can be the
> input to a DAG which does whatever we want to that exploded sequence,
> typically aggregating it, but really doing whatever we want.  Then the
> flatten operator handles splicing the output of the sub-DAG into the
> original record that had the list-like value.  There are two outputs of the
> flatten operator as well, which are the main data flow and a reference to
> the output of the DAG in the main data.
> 
> This style handles all of the normal grouping/aggregating type of things we
> want to do and it also handles all of Dremel's within syntax.
> 
> I also realized a few things as well
> 
> 1) the bind needs to be rooted in some data source so that we can
> understand scoping relative to schemas
> 
> 2) there is an important difference between two separate outputs of a DAG
> element and a single output that goes two places.
> 
> 3) everywhere I was wanting to inject an output field name can be handled
> by multiple outputs
> 
> 
> 
> I think that the logical plan spec is ready for two things, both of which
> can be done by somebody other than me:
> 
> 
> A - We can now start trying to convert an abstract syntax tree from
> Dremel-ish source into a logical plan
> 
> B - We can implement a toy interpreter for the logical plan that transforms
> sequences of trees.


Mime
View raw message