(And now that I really look at this, perhaps my outer query should be:
SELECT DocId
FROM data
WHERE Name IS NOT NULL
(rather than "Name.Url IS NOT NULL") so that the level of the filter and
project are the same.)
(And yes, I was being loose with the "SELECT" part for illustrative purposes
below. It should have been "SELECT DocId, Name.Url" to return the results
that I show.)
On 8/27/12 8:57 AM, "Rob Grzywinski" <rob@aggregateknowledge.com> wrote:
> You're highlighting a fundamental concern that I have about the queries as
> stated in the Dremel paper. Section 5 in the Dremel paper says:
>
> "Think of a nested record as a labeled tree, where each label corresponds to a
> field name. The selection op-erator prunes away the branches of the tree that
> do not satisfy the specified conditions. Thus, only those nested records are
> retained where Name.Url is defined and starts with http."
>
> Notice that it says "only those *nested* records are retained" and not "only
> those records are retained". (This also follows if you work through the
> algorithm in Figure 19 from the Dremel paper (what we've called the "trickle
> reassembly").)
>
> Specifically,
>
> SELECT DocId
> FROM data
> WHERE Name.Url = ³whatever²
>
> will return *all* records but prune those nested records where Name.Url
> doesn't equal "whatever". For example if you had the data (in JSON'y format):
>
> { {"DocId":1, "Name":[{"Url":"A1"},{"Url","whatever"},{"Url":"C1"}]},
> {"DocId":2, "Name":[{"Url":"A2"}]} }
>
> then the above query would return:
>
> { {"DocId":1, "Name":[{"Url","whatever"}]},
> {"DocId":2, "Name":null} }
>
> To then get only those records that have Name.Url equal to "whatever" you
> would need to wrap this with:
>
> SELECT DocId
> FROM ( results )
> WHERE Name.Url IS NOT NULL
>
> which gives:
>
> { {"DocId":1, "Name":[{"Url","whatever"}]} }
>
> The inner query that you presented would result in:
>
> SELECT DocId
> FROM data
> WHERE Name.Url IS NOT NULL
>
> { {"DocId":1, "Name":[{"Url":"A1"},{"Url","whatever"},{"Url":"C1"}]},
> {"DocId":2, "Name":[{"Url":"A2"}]} }
>
> (Nothing is restricted) And then the outer query is the same as my first query
> above:
>
> SELECT DocId
> FROM ( result )
> WHERE Name.Url = ³whatever²
>
> { {"DocId":1, "Name":[{"Url","whatever"}]},
> {"DocId":2, "Name":null} }
>
> I want to emphasize again that I really want to be wrong about my reading and
> interpretation of the Dremel paper.
>
>
> --
> Rob Grzywinski
>
>
>
> On 8/26/12 9:46 PM, "Constantine Peresypkin" <pconstantine@gmail.com> wrote:
>
>> Hello Rob,
>> I think you've meant
>>
>> SELECT DocId
>> FROM (
>> SELECT DocId
>> FROM data
>> WHERE Name.Url IS NOT NULL)
>> WHERE Name.Url = ³whatever²
>>
>> in the example query.
>>
>> In general you're right, you can only "optimize" queries that have nothing
>> but "required" fields inside.
>> But the whole point in Dremel columnar structure is: do a fast full-scan
>> and then fast prune.
>> That's why they have all those integer repetition/definition levels, they
>> only read the levels, if possible, and make early "pruning" decision.
>> But you can optimize the queries by other means, mainly by data
>> "compression", in fact you just optimize full column scans.
>> For example: do not store NULLs, use RLE encoding, dictionary encoding and
>> "plain" compression.
>> We even implemented some of these optimizations in OpenDremel.
>>
>> On Mon, Aug 27, 2012 at 1:50 AM, Rob Grzywinski
>> <rob@aggregateknowledge.com>wrote:
>>
>>> I wanted to share some of the internal work that we had done on Dremel +
>>> BigQuery-like queries to inject some momentum into the group.
>>>
>>> We finished a basic implementation of a field-stripe encoder and decoder
>>> and
>>> wanted to put a little effort into querying some data. (See
>>> https://github.com/rgrzywinski/field-stripe/ )
>>>
>>> We had the following requirements:
>>> * 1 pass over the data
>>> * SAX-style output (result set) record writer (implying that you don¹t
>>> cache
>>> results < you simply hand them off as you project / select)
>>> * support for select (filter), project and aggregate
>>>
>>> We quickly came to the realization that there is only one model that is
>>> supported under these requirements: sub-tree pruning. (You have to spend
>>> some time thinking about a schema tree and what happens if you have select
>>> and project in different branches of that tree. You realize that there are
>>> always cases in which you cannot evaluate your select before you have to
>>> project. This boils down to the fact that will hand off a result and then
>>> later realize that it should have been filtered. The only way that you can
>>> prevent this is by saying that selects only affect themselves and their
>>> children (i.e. ³sub-tree pruning²).) The problem with sub-tree pruning is
>>> that it¹s 100% counter to everything that you¹ve ever done with SQL. For
>>> example:
>>>
>>> SELECT DocId
>>> FROM data
>>> WHERE Name.Url = ³whatever²
>>>
>>> (using the schema from the Dremel paper)
>>>
>>> That query would return *all records* but prune those sub-records that
>>> don¹t
>>> have Name.Url equal to ³whatever². What would be more natural (i.e. more
>>> SQL-like) would be to filter the records that don¹t have Name.Url. But then
>>> one immediately asks ³what about the sub-records²? Do you both filter the
>>> records and prune the sub-trees? And what does ³=² mean if Name.Url is a
>>> collection? Does it mean ³has only², ³contains², other?
>>>
>>> So how does Dremel solve all of this? They just do the sub-record/tree
>>> purning and then wrap the query within another query to get what they want.
>>> (And FYI < Dremel / BigQuery only support two table joins (where either or
>>> both table can be a sub-select) --
>>> https://developers.google.com/bigquery/docs/query-reference)
>>>
>>> SELECT DocId
>>> FROM (
>>> SELECT DocId
>>> FROM data
>>> WHERE Name.Url = ³whatever² )
>>> WHERE Name.Url IS NOT NULL
>>>
>>> (Or something like that) You can always make a sequence of queries that
>>> gives you what you want.
>>>
>>> (I would love to see someone from the group validate or squash all of this
>>> theory and conjecture! I really want to be wrong here!)
>>>
>>> One really really wants to try this out in BigQuery. Unfortunately, in
>>> spite
>>> of promises, you still can¹t add hierarchical data to BigQuery. You can
>>> only
>>> use their pre-defined data sets which unfortunately doesn¹t have any
>>> interesting data in it to allow you to tease out answers. The moment that
>>> you use a flat structure then all of this hoopla goes away and you devolve
>>> to SQL (which is what it seems that the public version of BigQuery only
>>> supports at this moment).
>>>
>>> So we started thinking: what else does structured queries? XPath / XQuery.
>>> XPath really only gives you subtrees which isn¹t that interesting. XQuery
>>> immediately forces you into FLWOR. Having step 1 of writing a query
>>> processor be FLWOR is depressing.
>>>
>>> So we kept looking. What about JSON? There¹s a ton of dead projects for
>>> JSON
>>> queries :( We looked at MongoDB¹s query interface. We looked at
>>> CouchBase.
>>> Ah! UnQL ( http://www.unqlspec.org/display/UnQL/Home ) and a ton of press
>>> releases. But, oh wait, it seems to be dead (
>>> http://www.arangodb.org/2012/04/07/is_unql_dead ). Oh boy!. Then we find
>>> this AQL thing ( http://www.arangodb.org/manuals/Aql.html and more info at
>>> http://www.arangodb.org/2012/04/19/rfc-the-avocadodb-query-language ).
>>> Oddly, they had to resort to FLWOR-like structures too.
>>>
>>> (More to come ...)
>>>
>>>
>>> --
>>> Rob Grzywinski
>>>
>>>
>>
|