On Sat, Oct 13, 2012 at 2:03 AM, Julian Hyde <julianhyde@gmail.com> wrote:
> I guess I've been thinking about something a little different. I've been
> thinking about what would be an appropriate algebra to internally
> represent, manipulate, and optimize Drill queries.
>
I don't think that we are far apart.
> My conclusion is that the best candidate is relational algebra, augmented
> with data types that allow collections of nested records and with "explode"
> and "implode" operators. ("explode" takes a record with nested records and
> converts it to a sequence of flat records, plus a "location" value that
> indicates how the nested record fits into the parent; "implode" is the
> inverse: it takes a sequence of flat records with a "location" field and
> converts into a nested record.)
This is very close to what I now have in the plan document.
> ...
> The algebra is (probably) higher level than what Ted calls a "logical
> plan". His operators produce two outputs, and while that makes perfect
> sense for physical operators, it is difficult to reason about.
>
The multiple outputs fall into two cases:
1) a secondary output that is simply a reference to a new value that
doesn't yet have a name.
2) explode provides a reference to the original data stream. If you think
that flatten (aka implode) doesn't need that, then this isn't necessary.
I don't think that cases of (1) affect reasoning about the DAG and I think
that the unique case of (2) may be something we can eliminate.
To fix up DrQL's "where" operator, we convert it to "explode" followed by
> "filter" followed by "implode". To fix up aggregate "within", we apply
> "explode" then aggregate. We find that we never need to operate on trees.
> If we need to operate on a tree, we explode into several records, apply
> relational operators on those records, and if necessary implode back again.
> We're operating in relational algebra.
>
I think that this works with the current plan document.
> Am I over-engineering here? It's possible. Maybe Drill doesn't need query
> optimization. Maybe queries can go straight from a DrQL parse tree to a DAG
> of operators using a straightforward mapping.
If so, the query optimizer will be very simple. But I am pretty sure that
getting delayed record assembly to work right will require reasoning such
as you suggest.
|