drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rob Grzywinski <...@aggregateknowledge.com>
Subject Re: Stymied by queries
Date Mon, 27 Aug 2012 14:56:57 GMT
(And now that I really look at this, perhaps my outer query should be:

   SELECT DocId
     FROM data
    WHERE Name IS NOT NULL

(rather than "Name.Url IS NOT NULL") so that the level of the filter and
project are the same.)

(And yes, I was being loose with the "SELECT" part for illustrative purposes
below. It should have been "SELECT DocId, Name.Url" to return the results
that I show.)

On 8/27/12 8:57 AM, "Rob Grzywinski" <rob@aggregateknowledge.com> wrote:

> You're highlighting a fundamental concern that I have about the queries as
> stated in the Dremel paper. Section 5 in the Dremel paper says:
> 
> "Think of a nested record as a labeled tree, where each label corresponds to a
> field name. The selection op-erator prunes away the branches of the tree that
> do not satisfy the specified conditions. Thus, only those nested records are
> retained where Name.Url is defined and starts with http."
> 
> Notice that it says "only those *nested* records are retained" and not "only
> those records are retained".  (This also follows if you work through the
> algorithm in Figure 19 from the Dremel paper (what we've called the "trickle
> reassembly").)
> 
> Specifically, 
> 
>      SELECT DocId
>        FROM data
>       WHERE Name.Url = ³whatever²
> 
> will return *all* records but prune those nested records where Name.Url
> doesn't equal "whatever". For example if you had the data (in JSON'y format):
> 
> { {"DocId":1, "Name":[{"Url":"A1"},{"Url","whatever"},{"Url":"C1"}]},
>   {"DocId":2, "Name":[{"Url":"A2"}]} }
> 
> then the above query would return:
> 
> { {"DocId":1, "Name":[{"Url","whatever"}]},
>   {"DocId":2, "Name":null} }
> 
> To then get only those records that have Name.Url equal to "whatever" you
> would need to wrap this with:
> 
>      SELECT DocId
>        FROM ( results )
>       WHERE Name.Url IS NOT NULL
> 
> which gives:
> 
> { {"DocId":1, "Name":[{"Url","whatever"}]} }
> 
> The inner query that you presented would result in:
> 
>   SELECT DocId
>     FROM data
>    WHERE Name.Url IS NOT NULL
> 
> { {"DocId":1, "Name":[{"Url":"A1"},{"Url","whatever"},{"Url":"C1"}]},
>   {"DocId":2, "Name":[{"Url":"A2"}]} }
> 
> (Nothing is restricted) And then the outer query is the same as my first query
> above:
> 
>      SELECT DocId
>        FROM ( result )
>       WHERE Name.Url = ³whatever²
> 
> { {"DocId":1, "Name":[{"Url","whatever"}]},
>   {"DocId":2, "Name":null} }
> 
> I want to emphasize again that I really want to be wrong about my reading and
> interpretation of the Dremel paper.
> 
> 
> --
> Rob Grzywinski
> 
> 
> 
> On 8/26/12 9:46 PM, "Constantine Peresypkin" <pconstantine@gmail.com> wrote:
> 
>> Hello Rob,
>> I think you've meant
>> 
>> SELECT DocId
>>       FROM (
>>                SELECT DocId
>>                  FROM data
>>                 WHERE Name.Url IS NOT NULL)
>>       WHERE Name.Url = ³whatever²
>> 
>> in the example query.
>> 
>> In general you're right, you can only "optimize" queries that have nothing
>> but "required" fields inside.
>> But the whole point in Dremel columnar structure is: do a fast full-scan
>> and then fast prune.
>> That's why they have all those integer repetition/definition levels, they
>> only read the levels, if possible, and make early "pruning" decision.
>> But you can optimize the queries by other means, mainly by data
>> "compression", in fact you just optimize full column scans.
>> For example: do not store NULLs, use RLE encoding, dictionary encoding and
>> "plain" compression.
>> We even implemented some of these optimizations in OpenDremel.
>> 
>> On Mon, Aug 27, 2012 at 1:50 AM, Rob Grzywinski
>> <rob@aggregateknowledge.com>wrote:
>> 
>>> I wanted to share some of the internal work that we had done on Dremel +
>>> BigQuery-like queries to inject some momentum into the group.
>>> 
>>> We finished a basic implementation of a field-stripe encoder and decoder
>>> and
>>> wanted to put a little effort into querying some data. (See
>>> https://github.com/rgrzywinski/field-stripe/ )
>>> 
>>> We had the following requirements:
>>> * 1 pass over the data
>>> * SAX-style output (result set) record writer (implying that you don¹t
>>> cache
>>> results < you simply hand them off as you project / select)
>>> * support for select (filter), project and aggregate
>>> 
>>> We quickly came to the realization that there is only one model that is
>>> supported under these requirements: sub-tree pruning. (You have to spend
>>> some time thinking about a schema tree and what happens if you have select
>>> and project in different branches of that tree. You realize that there are
>>> always cases in which you cannot evaluate your select before you have to
>>> project. This boils down to the fact that will hand off a result and then
>>> later realize that it should have been filtered. The only way that you can
>>> prevent this is by saying that selects only affect themselves and their
>>> children (i.e. ³sub-tree pruning²).) The problem with sub-tree pruning is
>>> that it¹s 100% counter to everything that you¹ve ever done with SQL. For
>>> example:
>>> 
>>>     SELECT DocId
>>>       FROM data
>>>      WHERE Name.Url = ³whatever²
>>> 
>>> (using the schema from the Dremel paper)
>>> 
>>> That query would return *all records* but prune those sub-records that
>>> don¹t
>>> have Name.Url equal to ³whatever². What would be more natural (i.e. more
>>> SQL-like) would be to filter the records that don¹t have Name.Url. But then
>>> one immediately asks ³what about the sub-records²? Do you both filter the
>>> records and prune the sub-trees? And what does ³=² mean if Name.Url is a
>>> collection? Does it mean ³has only², ³contains², other?
>>> 
>>> So how does Dremel solve all of this? They just do the sub-record/tree
>>> purning and then wrap the query within another query to get what they want.
>>> (And FYI < Dremel / BigQuery only support two table joins (where either or
>>> both table can be a sub-select) --
>>> https://developers.google.com/bigquery/docs/query-reference)
>>> 
>>>     SELECT DocId
>>>       FROM (
>>>                SELECT DocId
>>>                  FROM data
>>>                WHERE Name.Url = ³whatever² )
>>>      WHERE Name.Url IS NOT NULL
>>> 
>>> (Or something like that) You can always make a sequence of queries that
>>> gives you what you want.
>>> 
>>> (I would love to see someone from the group validate or squash all of this
>>> theory and conjecture! I really want to be wrong here!)
>>> 
>>> One really really wants to try this out in BigQuery. Unfortunately, in
>>> spite
>>> of promises, you still can¹t add hierarchical data to BigQuery. You can
>>> only
>>> use their pre-defined data sets which unfortunately doesn¹t have any
>>> interesting data in it to allow you to tease out answers. The moment that
>>> you use a flat structure then all of this hoopla goes away and you devolve
>>> to SQL (which is what it seems that the public version of BigQuery only
>>> supports at this moment).
>>> 
>>> So we started thinking: what else does structured queries? XPath / XQuery.
>>> XPath really only gives you subtrees which isn¹t that interesting. XQuery
>>> immediately forces you into FLWOR. Having step 1 of writing a query
>>> processor be FLWOR is depressing.
>>> 
>>> So we kept looking. What about JSON? There¹s a ton of dead projects for
>>> JSON
>>> queries :(  We looked at MongoDB¹s query interface.  We looked at
>>> CouchBase.
>>> Ah! UnQL ( http://www.unqlspec.org/display/UnQL/Home ) and a ton of press
>>> releases. But, oh wait, it seems to be dead (
>>> http://www.arangodb.org/2012/04/07/is_unql_dead ). Oh boy!. Then we find
>>> this AQL thing ( http://www.arangodb.org/manuals/Aql.html and more info at
>>> http://www.arangodb.org/2012/04/19/rfc-the-avocadodb-query-language ).
>>> Oddly, they had to resort to FLWOR-like structures too.
>>> 
>>> (More to come ...)
>>> 
>>> 
>>> --
>>> Rob Grzywinski
>>> 
>>> 
>> 


Mime
View raw message