drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Hyde <julianh...@gmail.com>
Subject Next steps, "hello world" queries, and rants about query languages
Date Wed, 19 Sep 2012 20:33:37 GMT
On Sep 18, 2012, at 4:09 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> Excellent.
> I think that the key next steps are:
> 1) codify some API's (Jason and Julian are on this)
> 2) build out the "hello world" app using the available parser and Optiq and
> a trivial data model (#1 is not a blocker for this)
> 3) evaluate where we are implications are relative to nested data and
> columnar formats

It would be really useful to establish a list of queries. Minimal, but demonstrate the key
features (e.g. "COUNT(a.b) WITHIN RECORD", the ability to aggregate over nested collections.

> Once we have that, we should be able to open up the parallelism of
> development massively.
> Some open questions/tasks that many people may be able to determine include:
> - is it possible to change Optiq operators and optimization rules without a
> recompile?

Yes. (From Optiq's perspective, anyway, it is very straightforward. Optiq starts from scratch
each time a query is prepared, and takes whatever metadata sources, operators and rules it
is given.)

> - what has to happen to make Optiq handle nested data?

I am still mulling the answer to that question. The list of queries I mentioned above will
help me understand what exactly is "nested data" and the operations on it.

I have been using http://developers.google.com/bigquery/docs/query-reference as my main resource
for BigQuery/DrQL. Let me know if there is a more authoritative source.

For the record (I'll only make this whine once, I promise) the BigQuery language has some
usability flaws, particularly inconsistencies with SQL. Some of these may have been deliberate
(for instance, to try to make the language more concise for the user to type), or may have
been made out of ignorance for SQL.

* The paper has an expression like "COUNT(x > 5)". SQL's COUNT operator counts not-NULL
values, not boolean TRUE values, so in SQL this expression would not have the desired effect

* My jaw dropped when I read "Unlike many other SQL-based systems, BigQuery uses the comma
syntax to indicate table unions, not joins". I see how commas are a convenient abbreviation
for UNION, but that's like redefining "if" in a programming language.

* "Note that the HAVING clause can only refer to fields defined in your SELECT clause (if
the field has an alias, you must use it; if it doesn't, use the aggregate field name instead)."
This kind of design decision makes the language difficult to generate for.

* Apparently string literals can be enclosed in either single or double quotes. This may be
for the convenience of the user, but is at odds with SQL, which allows only single quotes.
(Many SQL dialects use double quotes to quote identifiers that contain mixed case or spaces.)

* BigQuery allows table names to be prefixed with an optional "projectname:".

* The syntax for accessing nested collections at depth 2 or more. If "a" is a table alias,
and b is a collection-valued field, then "a.b" is the collection of sub-records, and "a.b.c"
is the union of the collections of the "c" field of all records in the "a.b" collection. In
'a.b.c', the first '.' is the conventional operator that accesses a field of a record. But
the second '.' is a strange beast that operates on a set of records.

Most of these I offer as evidence that DrQL is not a superset of SQL. I suppose we can vive
la difference, implement separate parsers/validators for the two languages, including implementing
DrQL "features" we think are ill-advised.

But I was looking to DrQL to find out how to query nested collections, and I didn't find much
depth. How, for instance, to convert a "dept" relation with nested "emp" records into a "flat"
relation? Or given an "shipment.order.lineitem" nested relation, sum up lineitem.discount
only in orders that have "order.prepaid=true"? Are nested collections ordered?

SQL's support for nested collections is powerful & consistent. (But not very concise,
or intuitive for the non-expert.) It has operators like UNNEST that convert a nested collection
into a relation (note that collections and relations are different beasts), and operators
such as CARDINALITY to aggregate them. (My friend John Sichi wrote a great overview here:

To be clear, I am not advocating adding SQL's collection constructs to Drill's query language.
But I am drawing inspiration from them when designing how Optiq would represent collections,
because a query planner needs semantics to be very, very precise. I can see how most of DrQL's
constructs would map onto SQL's constructs.

I'll have a better answer to the "what has to happen to make Optiq handle nested data" question
in a day or two.

View raw message