drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rob Grzywinski <...@aggregateknowledge.com>
Subject Re: Stymied by queries
Date Tue, 28 Aug 2012 00:36:09 GMT
I think that you¹ve captured it well. My meta-point on all of this is that
if you just read the Dremel paper casually then you can walk away thinking
"Oooh! Look. I just run normal SQL and things go fast!". But the reality is
that when you start introducing the hierarchical / nested types then SQL
starts to get in your way more than it helps. (Which is why I started to
look at FLWOR and the like.)

--
Rob Grzywinski


On 8/27/12 11:12 AM, "Constantine Peresypkin" <pconstantine@gmail.com>
wrote:

> I think I understand what you want.
> In BQL you write it as follows:
> 
> SELECT DocId, LAST(Name.Url) WITHIN Name AS Url
>     FROM data
>     WHERE Name.Url = 'whatever'
> 
> And then you can prune the NULL ones.
> 
> And if you think that this is a "problem", you haven't seen anything yet. :)
> 
> Constantine
> 
> On Mon, Aug 27, 2012 at 5:56 PM, Rob Grzywinski
> <rob@aggregateknowledge.com>wrote:
> 
>> (And now that I really look at this, perhaps my outer query should be:
>> 
>>    SELECT DocId
>>      FROM data
>>     WHERE Name IS NOT NULL
>> 
>> (rather than "Name.Url IS NOT NULL") so that the level of the filter and
>> project are the same.)
>> 
>> (And yes, I was being loose with the "SELECT" part for illustrative
>> purposes
>> below. It should have been "SELECT DocId, Name.Url" to return the results
>> that I show.)
>> 
>> On 8/27/12 8:57 AM, "Rob Grzywinski" <rob@aggregateknowledge.com> wrote:
>> 
>>> You're highlighting a fundamental concern that I have about the queries
>> as
>>> stated in the Dremel paper. Section 5 in the Dremel paper says:
>>> 
>>> "Think of a nested record as a labeled tree, where each label
>> corresponds to a
>>> field name. The selection op-erator prunes away the branches of the tree
>> that
>>> do not satisfy the specified conditions. Thus, only those nested records
>> are
>>> retained where Name.Url is defined and starts with http."
>>> 
>>> Notice that it says "only those *nested* records are retained" and not
>> "only
>>> those records are retained".  (This also follows if you work through the
>>> algorithm in Figure 19 from the Dremel paper (what we've called the
>> "trickle
>>> reassembly").)
>>> 
>>> Specifically,
>>> 
>>>      SELECT DocId
>>>        FROM data
>>>       WHERE Name.Url = ³whatever²
>>> 
>>> will return *all* records but prune those nested records where Name.Url
>>> doesn't equal "whatever". For example if you had the data (in JSON'y
>> format):
>>> 
>>> { {"DocId":1, "Name":[{"Url":"A1"},{"Url","whatever"},{"Url":"C1"}]},
>>>   {"DocId":2, "Name":[{"Url":"A2"}]} }
>>> 
>>> then the above query would return:
>>> 
>>> { {"DocId":1, "Name":[{"Url","whatever"}]},
>>>   {"DocId":2, "Name":null} }
>>> 
>>> To then get only those records that have Name.Url equal to "whatever" you
>>> would need to wrap this with:
>>> 
>>>      SELECT DocId
>>>        FROM ( results )
>>>       WHERE Name.Url IS NOT NULL
>>> 
>>> which gives:
>>> 
>>> { {"DocId":1, "Name":[{"Url","whatever"}]} }
>>> 
>>> The inner query that you presented would result in:
>>> 
>>>   SELECT DocId
>>>     FROM data
>>>    WHERE Name.Url IS NOT NULL
>>> 
>>> { {"DocId":1, "Name":[{"Url":"A1"},{"Url","whatever"},{"Url":"C1"}]},
>>>   {"DocId":2, "Name":[{"Url":"A2"}]} }
>>> 
>>> (Nothing is restricted) And then the outer query is the same as my first
>> query
>>> above:
>>> 
>>>      SELECT DocId
>>>        FROM ( result )
>>>       WHERE Name.Url = ³whatever²
>>> 
>>> { {"DocId":1, "Name":[{"Url","whatever"}]},
>>>   {"DocId":2, "Name":null} }
>>> 
>>> I want to emphasize again that I really want to be wrong about my
>> reading and
>>> interpretation of the Dremel paper.
>>> 
>>> 
>>> --
>>> Rob Grzywinski
>>> 
>>> 
>>> 
>>> On 8/26/12 9:46 PM, "Constantine Peresypkin" <pconstantine@gmail.com>
>> wrote:
>>> 
>>>> Hello Rob,
>>>> I think you've meant
>>>> 
>>>> SELECT DocId
>>>>       FROM (
>>>>                SELECT DocId
>>>>                  FROM data
>>>>                 WHERE Name.Url IS NOT NULL)
>>>>       WHERE Name.Url = ³whatever²
>>>> 
>>>> in the example query.
>>>> 
>>>> In general you're right, you can only "optimize" queries that have
>> nothing
>>>> but "required" fields inside.
>>>> But the whole point in Dremel columnar structure is: do a fast full-scan
>>>> and then fast prune.
>>>> That's why they have all those integer repetition/definition levels,
>> they
>>>> only read the levels, if possible, and make early "pruning" decision.
>>>> But you can optimize the queries by other means, mainly by data
>>>> "compression", in fact you just optimize full column scans.
>>>> For example: do not store NULLs, use RLE encoding, dictionary encoding
>> and
>>>> "plain" compression.
>>>> We even implemented some of these optimizations in OpenDremel.
>>>> 
>>>> On Mon, Aug 27, 2012 at 1:50 AM, Rob Grzywinski
>>>> <rob@aggregateknowledge.com>wrote:
>>>> 
>>>>> I wanted to share some of the internal work that we had done on Dremel
>> +
>>>>> BigQuery-like queries to inject some momentum into the group.
>>>>> 
>>>>> We finished a basic implementation of a field-stripe encoder and
>> decoder
>>>>> and
>>>>> wanted to put a little effort into querying some data. (See
>>>>> https://github.com/rgrzywinski/field-stripe/ )
>>>>> 
>>>>> We had the following requirements:
>>>>> * 1 pass over the data
>>>>> * SAX-style output (result set) record writer (implying that you don¹t
>>>>> cache
>>>>> results < you simply hand them off as you project / select)
>>>>> * support for select (filter), project and aggregate
>>>>> 
>>>>> We quickly came to the realization that there is only one model that
is
>>>>> supported under these requirements: sub-tree pruning. (You have to
>> spend
>>>>> some time thinking about a schema tree and what happens if you have
>> select
>>>>> and project in different branches of that tree. You realize that there
>> are
>>>>> always cases in which you cannot evaluate your select before you have
>> to
>>>>> project. This boils down to the fact that will hand off a result and
>> then
>>>>> later realize that it should have been filtered. The only way that you
>> can
>>>>> prevent this is by saying that selects only affect themselves and their
>>>>> children (i.e. ³sub-tree pruning²).) The problem with sub-tree pruning
>> is
>>>>> that it¹s 100% counter to everything that you¹ve ever done with SQL.
>> For
>>>>> example:
>>>>> 
>>>>>     SELECT DocId
>>>>>       FROM data
>>>>>      WHERE Name.Url = ³whatever²
>>>>> 
>>>>> (using the schema from the Dremel paper)
>>>>> 
>>>>> That query would return *all records* but prune those sub-records that
>>>>> don¹t
>>>>> have Name.Url equal to ³whatever². What would be more natural (i.e.
>> more
>>>>> SQL-like) would be to filter the records that don¹t have Name.Url. But
>> then
>>>>> one immediately asks ³what about the sub-records²? Do you both filter
>> the
>>>>> records and prune the sub-trees? And what does ³=² mean if Name.Url
is
>> a
>>>>> collection? Does it mean ³has only², ³contains², other?
>>>>> 
>>>>> So how does Dremel solve all of this? They just do the sub-record/tree
>>>>> purning and then wrap the query within another query to get what they
>> want.
>>>>> (And FYI < Dremel / BigQuery only support two table joins (where
>> either or
>>>>> both table can be a sub-select) --
>>>>> https://developers.google.com/bigquery/docs/query-reference)
>>>>> 
>>>>>     SELECT DocId
>>>>>       FROM (
>>>>>                SELECT DocId
>>>>>                  FROM data
>>>>>                WHERE Name.Url = ³whatever² )
>>>>>      WHERE Name.Url IS NOT NULL
>>>>> 
>>>>> (Or something like that) You can always make a sequence of queries that
>>>>> gives you what you want.
>>>>> 
>>>>> (I would love to see someone from the group validate or squash all of
>> this
>>>>> theory and conjecture! I really want to be wrong here!)
>>>>> 
>>>>> One really really wants to try this out in BigQuery. Unfortunately, in
>>>>> spite
>>>>> of promises, you still can¹t add hierarchical data to BigQuery. You
can
>>>>> only
>>>>> use their pre-defined data sets which unfortunately doesn¹t have any
>>>>> interesting data in it to allow you to tease out answers. The moment
>> that
>>>>> you use a flat structure then all of this hoopla goes away and you
>> devolve
>>>>> to SQL (which is what it seems that the public version of BigQuery only
>>>>> supports at this moment).
>>>>> 
>>>>> So we started thinking: what else does structured queries? XPath /
>> XQuery.
>>>>> XPath really only gives you subtrees which isn¹t that interesting.
>> XQuery
>>>>> immediately forces you into FLWOR. Having step 1 of writing a query
>>>>> processor be FLWOR is depressing.
>>>>> 
>>>>> So we kept looking. What about JSON? There¹s a ton of dead projects
for
>>>>> JSON
>>>>> queries :(  We looked at MongoDB¹s query interface.  We looked at
>>>>> CouchBase.
>>>>> Ah! UnQL ( http://www.unqlspec.org/display/UnQL/Home ) and a ton of
>> press
>>>>> releases. But, oh wait, it seems to be dead (
>>>>> http://www.arangodb.org/2012/04/07/is_unql_dead ). Oh boy!. Then we
>> find
>>>>> this AQL thing ( http://www.arangodb.org/manuals/Aql.html and more
>> info at
>>>>> http://www.arangodb.org/2012/04/19/rfc-the-avocadodb-query-language ).
>>>>> Oddly, they had to resort to FLWOR-like structures too.
>>>>> 
>>>>> (More to come ...)
>>>>> 
>>>>> 
>>>>> --
>>>>> Rob Grzywinski
>>>>> 
>>>>> 
>>>> 
>> 
>> 
> 


Mime
View raw message