drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rob Grzywinski <...@aggregateknowledge.com>
Subject Re: Stymied by queries
Date Mon, 27 Aug 2012 13:57:12 GMT
You're highlighting a fundamental concern that I have about the queries as
stated in the Dremel paper. Section 5 in the Dremel paper says:

"Think of a nested record as a labeled tree, where each label corresponds to
a field name. The selection op-erator prunes away the branches of the tree
that do not satisfy the specified conditions. Thus, only those nested
records are retained where Name.Url is defined and starts with http."

Notice that it says "only those *nested* records are retained" and not "only
those records are retained".  (This also follows if you work through the
algorithm in Figure 19 from the Dremel paper (what we've called the "trickle
reassembly").)

Specifically, 

     SELECT DocId
       FROM data
      WHERE Name.Url = ³whatever²

will return *all* records but prune those nested records where Name.Url
doesn't equal "whatever". For example if you had the data (in JSON'y
format):

{ {"DocId":1, "Name":[{"Url":"A1"},{"Url","whatever"},{"Url":"C1"}]},
  {"DocId":2, "Name":[{"Url":"A2"}]} }

then the above query would return:

{ {"DocId":1, "Name":[{"Url","whatever"}]},
  {"DocId":2, "Name":null} }

To then get only those records that have Name.Url equal to "whatever" you
would need to wrap this with:

     SELECT DocId
       FROM ( results )
      WHERE Name.Url IS NOT NULL

which gives:

{ {"DocId":1, "Name":[{"Url","whatever"}]} }

The inner query that you presented would result in:

  SELECT DocId
    FROM data
   WHERE Name.Url IS NOT NULL

{ {"DocId":1, "Name":[{"Url":"A1"},{"Url","whatever"},{"Url":"C1"}]},
  {"DocId":2, "Name":[{"Url":"A2"}]} }

(Nothing is restricted) And then the outer query is the same as my first
query above:

     SELECT DocId
       FROM ( result )
      WHERE Name.Url = ³whatever²

{ {"DocId":1, "Name":[{"Url","whatever"}]},
  {"DocId":2, "Name":null} }

I want to emphasize again that I really want to be wrong about my reading
and interpretation of the Dremel paper.


--
Rob Grzywinski



On 8/26/12 9:46 PM, "Constantine Peresypkin" <pconstantine@gmail.com> wrote:

> Hello Rob,
> I think you've meant
> 
> SELECT DocId
>       FROM (
>                SELECT DocId
>                  FROM data
>                 WHERE Name.Url IS NOT NULL)
>       WHERE Name.Url = ³whatever²
> 
> in the example query.
> 
> In general you're right, you can only "optimize" queries that have nothing
> but "required" fields inside.
> But the whole point in Dremel columnar structure is: do a fast full-scan
> and then fast prune.
> That's why they have all those integer repetition/definition levels, they
> only read the levels, if possible, and make early "pruning" decision.
> But you can optimize the queries by other means, mainly by data
> "compression", in fact you just optimize full column scans.
> For example: do not store NULLs, use RLE encoding, dictionary encoding and
> "plain" compression.
> We even implemented some of these optimizations in OpenDremel.
> 
> On Mon, Aug 27, 2012 at 1:50 AM, Rob Grzywinski
> <rob@aggregateknowledge.com>wrote:
> 
>> I wanted to share some of the internal work that we had done on Dremel +
>> BigQuery-like queries to inject some momentum into the group.
>> 
>> We finished a basic implementation of a field-stripe encoder and decoder
>> and
>> wanted to put a little effort into querying some data. (See
>> https://github.com/rgrzywinski/field-stripe/ )
>> 
>> We had the following requirements:
>> * 1 pass over the data
>> * SAX-style output (result set) record writer (implying that you don¹t
>> cache
>> results < you simply hand them off as you project / select)
>> * support for select (filter), project and aggregate
>> 
>> We quickly came to the realization that there is only one model that is
>> supported under these requirements: sub-tree pruning. (You have to spend
>> some time thinking about a schema tree and what happens if you have select
>> and project in different branches of that tree. You realize that there are
>> always cases in which you cannot evaluate your select before you have to
>> project. This boils down to the fact that will hand off a result and then
>> later realize that it should have been filtered. The only way that you can
>> prevent this is by saying that selects only affect themselves and their
>> children (i.e. ³sub-tree pruning²).) The problem with sub-tree pruning is
>> that it¹s 100% counter to everything that you¹ve ever done with SQL. For
>> example:
>> 
>>     SELECT DocId
>>       FROM data
>>      WHERE Name.Url = ³whatever²
>> 
>> (using the schema from the Dremel paper)
>> 
>> That query would return *all records* but prune those sub-records that
>> don¹t
>> have Name.Url equal to ³whatever². What would be more natural (i.e. more
>> SQL-like) would be to filter the records that don¹t have Name.Url. But then
>> one immediately asks ³what about the sub-records²? Do you both filter the
>> records and prune the sub-trees? And what does ³=² mean if Name.Url is a
>> collection? Does it mean ³has only², ³contains², other?
>> 
>> So how does Dremel solve all of this? They just do the sub-record/tree
>> purning and then wrap the query within another query to get what they want.
>> (And FYI < Dremel / BigQuery only support two table joins (where either or
>> both table can be a sub-select) --
>> https://developers.google.com/bigquery/docs/query-reference)
>> 
>>     SELECT DocId
>>       FROM (
>>                SELECT DocId
>>                  FROM data
>>                WHERE Name.Url = ³whatever² )
>>      WHERE Name.Url IS NOT NULL
>> 
>> (Or something like that) You can always make a sequence of queries that
>> gives you what you want.
>> 
>> (I would love to see someone from the group validate or squash all of this
>> theory and conjecture! I really want to be wrong here!)
>> 
>> One really really wants to try this out in BigQuery. Unfortunately, in
>> spite
>> of promises, you still can¹t add hierarchical data to BigQuery. You can
>> only
>> use their pre-defined data sets which unfortunately doesn¹t have any
>> interesting data in it to allow you to tease out answers. The moment that
>> you use a flat structure then all of this hoopla goes away and you devolve
>> to SQL (which is what it seems that the public version of BigQuery only
>> supports at this moment).
>> 
>> So we started thinking: what else does structured queries? XPath / XQuery.
>> XPath really only gives you subtrees which isn¹t that interesting. XQuery
>> immediately forces you into FLWOR. Having step 1 of writing a query
>> processor be FLWOR is depressing.
>> 
>> So we kept looking. What about JSON? There¹s a ton of dead projects for
>> JSON
>> queries :(  We looked at MongoDB¹s query interface.  We looked at
>> CouchBase.
>> Ah! UnQL ( http://www.unqlspec.org/display/UnQL/Home ) and a ton of press
>> releases. But, oh wait, it seems to be dead (
>> http://www.arangodb.org/2012/04/07/is_unql_dead ). Oh boy!. Then we find
>> this AQL thing ( http://www.arangodb.org/manuals/Aql.html and more info at
>> http://www.arangodb.org/2012/04/19/rfc-the-avocadodb-query-language ).
>> Oddly, they had to resort to FLWOR-like structures too.
>> 
>> (More to come ...)
>> 
>> 
>> --
>> Rob Grzywinski
>> 
>> 
> 


Mime
View raw message