drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Constantine Peresypkin <pconstant...@gmail.com>
Subject Re: Stymied by queries
Date Mon, 27 Aug 2012 02:46:51 GMT
Hello Rob,
I think you've meant

SELECT DocId
      FROM (
               SELECT DocId
                 FROM data
                WHERE Name.Url IS NOT NULL)
      WHERE Name.Url = ³whatever²

in the example query.

In general you're right, you can only "optimize" queries that have nothing
but "required" fields inside.
But the whole point in Dremel columnar structure is: do a fast full-scan
and then fast prune.
That's why they have all those integer repetition/definition levels, they
only read the levels, if possible, and make early "pruning" decision.
But you can optimize the queries by other means, mainly by data
"compression", in fact you just optimize full column scans.
For example: do not store NULLs, use RLE encoding, dictionary encoding and
"plain" compression.
We even implemented some of these optimizations in OpenDremel.

On Mon, Aug 27, 2012 at 1:50 AM, Rob Grzywinski
<rob@aggregateknowledge.com>wrote:

> I wanted to share some of the internal work that we had done on Dremel +
> BigQuery-like queries to inject some momentum into the group.
>
> We finished a basic implementation of a field-stripe encoder and decoder
> and
> wanted to put a little effort into querying some data. (See
> https://github.com/rgrzywinski/field-stripe/ )
>
> We had the following requirements:
> * 1 pass over the data
> * SAX-style output (result set) record writer (implying that you don¹t
> cache
> results < you simply hand them off as you project / select)
> * support for select (filter), project and aggregate
>
> We quickly came to the realization that there is only one model that is
> supported under these requirements: sub-tree pruning. (You have to spend
> some time thinking about a schema tree and what happens if you have select
> and project in different branches of that tree. You realize that there are
> always cases in which you cannot evaluate your select before you have to
> project. This boils down to the fact that will hand off a result and then
> later realize that it should have been filtered. The only way that you can
> prevent this is by saying that selects only affect themselves and their
> children (i.e. ³sub-tree pruning²).) The problem with sub-tree pruning is
> that it¹s 100% counter to everything that you¹ve ever done with SQL. For
> example:
>
>     SELECT DocId
>       FROM data
>      WHERE Name.Url = ³whatever²
>
> (using the schema from the Dremel paper)
>
> That query would return *all records* but prune those sub-records that
> don¹t
> have Name.Url equal to ³whatever². What would be more natural (i.e. more
> SQL-like) would be to filter the records that don¹t have Name.Url. But then
> one immediately asks ³what about the sub-records²? Do you both filter the
> records and prune the sub-trees? And what does ³=² mean if Name.Url is a
> collection? Does it mean ³has only², ³contains², other?
>
> So how does Dremel solve all of this? They just do the sub-record/tree
> purning and then wrap the query within another query to get what they want.
> (And FYI < Dremel / BigQuery only support two table joins (where either or
> both table can be a sub-select) --
> https://developers.google.com/bigquery/docs/query-reference)
>
>     SELECT DocId
>       FROM (
>                SELECT DocId
>                  FROM data
>                WHERE Name.Url = ³whatever² )
>      WHERE Name.Url IS NOT NULL
>
> (Or something like that) You can always make a sequence of queries that
> gives you what you want.
>
> (I would love to see someone from the group validate or squash all of this
> theory and conjecture! I really want to be wrong here!)
>
> One really really wants to try this out in BigQuery. Unfortunately, in
> spite
> of promises, you still can¹t add hierarchical data to BigQuery. You can
> only
> use their pre-defined data sets which unfortunately doesn¹t have any
> interesting data in it to allow you to tease out answers. The moment that
> you use a flat structure then all of this hoopla goes away and you devolve
> to SQL (which is what it seems that the public version of BigQuery only
> supports at this moment).
>
> So we started thinking: what else does structured queries? XPath / XQuery.
> XPath really only gives you subtrees which isn¹t that interesting. XQuery
> immediately forces you into FLWOR. Having step 1 of writing a query
> processor be FLWOR is depressing.
>
> So we kept looking. What about JSON? There¹s a ton of dead projects for
> JSON
> queries :(  We looked at MongoDB¹s query interface.  We looked at
> CouchBase.
> Ah! UnQL ( http://www.unqlspec.org/display/UnQL/Home ) and a ton of press
> releases. But, oh wait, it seems to be dead (
> http://www.arangodb.org/2012/04/07/is_unql_dead ). Oh boy!. Then we find
> this AQL thing ( http://www.arangodb.org/manuals/Aql.html and more info at
> http://www.arangodb.org/2012/04/19/rfc-the-avocadodb-query-language ).
> Oddly, they had to resort to FLWOR-like structures too.
>
> (More to come ...)
>
>
> --
> Rob Grzywinski
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message