drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Frantz <jfra...@maprtech.com>
Subject Re: Stymied by queries
Date Mon, 27 Aug 2012 04:57:23 GMT
Hi Rob,

Looks like you guys have put a lot of thought and work into building
support for Dremel-style data packing. I'm in total agreement with you on
the requirements of decoding. I'm also in agreement about FLWOR - I've
personally never seen anything about XQuery that made me excited to program.

All that said, I don't quite understand the dilemma with your query.

SELECT DocId FROM data WHERE Name.Url = <whatever>

Why is filtering on Name.Url different than in SQL? SQL database already
use the same NULL rejection pattern you represented as a subquery. Should
equality be treated differently for collections (e.g. "contains") and
scalars? I don't see why this should be the case.

-Jason

On Sun, Aug 26, 2012 at 3:50 PM, Rob Grzywinski
<rob@aggregateknowledge.com>wrote:

> I wanted to share some of the internal work that we had done on Dremel +
> BigQuery-like queries to inject some momentum into the group.
>
> We finished a basic implementation of a field-stripe encoder and decoder
> and
> wanted to put a little effort into querying some data. (See
> https://github.com/rgrzywinski/field-stripe/ )
>
> We had the following requirements:
> * 1 pass over the data
> * SAX-style output (result set) record writer (implying that you don¹t
> cache
> results < you simply hand them off as you project / select)
> * support for select (filter), project and aggregate
>
> We quickly came to the realization that there is only one model that is
> supported under these requirements: sub-tree pruning. (You have to spend
> some time thinking about a schema tree and what happens if you have select
> and project in different branches of that tree. You realize that there are
> always cases in which you cannot evaluate your select before you have to
> project. This boils down to the fact that will hand off a result and then
> later realize that it should have been filtered. The only way that you can
> prevent this is by saying that selects only affect themselves and their
> children (i.e. ³sub-tree pruning²).) The problem with sub-tree pruning is
> that it¹s 100% counter to everything that you¹ve ever done with SQL. For
> example:
>
>     SELECT DocId
>       FROM data
>      WHERE Name.Url = ³whatever²
>
> (using the schema from the Dremel paper)
>
> That query would return *all records* but prune those sub-records that
> don¹t
> have Name.Url equal to ³whatever². What would be more natural (i.e. more
> SQL-like) would be to filter the records that don¹t have Name.Url. But then
> one immediately asks ³what about the sub-records²? Do you both filter the
> records and prune the sub-trees? And what does ³=² mean if Name.Url is a
> collection? Does it mean ³has only², ³contains², other?
>
> So how does Dremel solve all of this? They just do the sub-record/tree
> purning and then wrap the query within another query to get what they want.
> (And FYI < Dremel / BigQuery only support two table joins (where either or
> both table can be a sub-select) --
> https://developers.google.com/bigquery/docs/query-reference)
>
>     SELECT DocId
>       FROM (
>                SELECT DocId
>                  FROM data
>                WHERE Name.Url = ³whatever² )
>      WHERE Name.Url IS NOT NULL
>
> (Or something like that) You can always make a sequence of queries that
> gives you what you want.
>
> (I would love to see someone from the group validate or squash all of this
> theory and conjecture! I really want to be wrong here!)
>
> One really really wants to try this out in BigQuery. Unfortunately, in
> spite
> of promises, you still can¹t add hierarchical data to BigQuery. You can
> only
> use their pre-defined data sets which unfortunately doesn¹t have any
> interesting data in it to allow you to tease out answers. The moment that
> you use a flat structure then all of this hoopla goes away and you devolve
> to SQL (which is what it seems that the public version of BigQuery only
> supports at this moment).
>
> So we started thinking: what else does structured queries? XPath / XQuery.
> XPath really only gives you subtrees which isn¹t that interesting. XQuery
> immediately forces you into FLWOR. Having step 1 of writing a query
> processor be FLWOR is depressing.
>
> So we kept looking. What about JSON? There¹s a ton of dead projects for
> JSON
> queries :(  We looked at MongoDB¹s query interface.  We looked at
> CouchBase.
> Ah! UnQL ( http://www.unqlspec.org/display/UnQL/Home ) and a ton of press
> releases. But, oh wait, it seems to be dead (
> http://www.arangodb.org/2012/04/07/is_unql_dead ). Oh boy!. Then we find
> this AQL thing ( http://www.arangodb.org/manuals/Aql.html and more info at
> http://www.arangodb.org/2012/04/19/rfc-the-avocadodb-query-language ).
> Oddly, they had to resort to FLWOR-like structures too.
>
> (More to come ...)
>
>
> --
> Rob Grzywinski
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message