drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Camuel Gilyadov <cam...@gmail.com>
Subject Re: Stymied by queries
Date Mon, 27 Aug 2012 10:48:56 GMT
The example is also not clear to me.

But I remember when Dremel query language applied to nested data it has a
"flattening" anomaly which I think is a consequence of its iteration model.
Here is a description of "flattening anomaly":

1. Query only prunes record's hierarchy. No new levels can be created.
2. However existing levels can be expanded with new fields within query.
3. Also deeper fields could be brought to the surface as within query.
4. More shallow fields cannot be put deeper within query.
5. So abstractly speaking: complexier you query more flattened resulting
record becomes.
6. More concrete: you need to compare fields on different levels of
hierarchy within record. You forced to create two nested queries, the inner
one irreversibly flattens the original nested record bringing all fields to
the surface level and then outer query now examines all fields at once
making required comparisons. But you cannot now pack it back... and as
consequence of the patch you now got flattened record and it breaks all the
application code above if you not repack it with adapter piece of code.
7.  Another anomaly: while iterating you are able to compare nested fields
with same last indices even if they are not related. But you cannot compare
fields with different last indices without first irreversibly "flattening"
the record.





On Mon, Aug 27, 2012 at 5:46 AM, Constantine Peresypkin <
pconstantine@gmail.com> wrote:

> Hello Rob,
> I think you've meant
>
> SELECT DocId
>       FROM (
>                SELECT DocId
>                  FROM data
>                 WHERE Name.Url IS NOT NULL)
>       WHERE Name.Url = ³whatever²
>
> in the example query.
>
> In general you're right, you can only "optimize" queries that have nothing
> but "required" fields inside.
> But the whole point in Dremel columnar structure is: do a fast full-scan
> and then fast prune.
> That's why they have all those integer repetition/definition levels, they
> only read the levels, if possible, and make early "pruning" decision.
> But you can optimize the queries by other means, mainly by data
> "compression", in fact you just optimize full column scans.
> For example: do not store NULLs, use RLE encoding, dictionary encoding and
> "plain" compression.
> We even implemented some of these optimizations in OpenDremel.
>
> On Mon, Aug 27, 2012 at 1:50 AM, Rob Grzywinski
> <rob@aggregateknowledge.com>wrote:
>
> > I wanted to share some of the internal work that we had done on Dremel +
> > BigQuery-like queries to inject some momentum into the group.
> >
> > We finished a basic implementation of a field-stripe encoder and decoder
> > and
> > wanted to put a little effort into querying some data. (See
> > https://github.com/rgrzywinski/field-stripe/ )
> >
> > We had the following requirements:
> > * 1 pass over the data
> > * SAX-style output (result set) record writer (implying that you don¹t
> > cache
> > results < you simply hand them off as you project / select)
> > * support for select (filter), project and aggregate
> >
> > We quickly came to the realization that there is only one model that is
> > supported under these requirements: sub-tree pruning. (You have to spend
> > some time thinking about a schema tree and what happens if you have
> select
> > and project in different branches of that tree. You realize that there
> are
> > always cases in which you cannot evaluate your select before you have to
> > project. This boils down to the fact that will hand off a result and then
> > later realize that it should have been filtered. The only way that you
> can
> > prevent this is by saying that selects only affect themselves and their
> > children (i.e. ³sub-tree pruning²).) The problem with sub-tree pruning is
> > that it¹s 100% counter to everything that you¹ve ever done with SQL. For
> > example:
> >
> >     SELECT DocId
> >       FROM data
> >      WHERE Name.Url = ³whatever²
> >
> > (using the schema from the Dremel paper)
> >
> > That query would return *all records* but prune those sub-records that
> > don¹t
> > have Name.Url equal to ³whatever². What would be more natural (i.e. more
> > SQL-like) would be to filter the records that don¹t have Name.Url. But
> then
> > one immediately asks ³what about the sub-records²? Do you both filter the
> > records and prune the sub-trees? And what does ³=² mean if Name.Url is a
> > collection? Does it mean ³has only², ³contains², other?
> >
> > So how does Dremel solve all of this? They just do the sub-record/tree
> > purning and then wrap the query within another query to get what they
> want.
> > (And FYI < Dremel / BigQuery only support two table joins (where either
> or
> > both table can be a sub-select) --
> > https://developers.google.com/bigquery/docs/query-reference)
> >
> >     SELECT DocId
> >       FROM (
> >                SELECT DocId
> >                  FROM data
> >                WHERE Name.Url = ³whatever² )
> >      WHERE Name.Url IS NOT NULL
> >
> > (Or something like that) You can always make a sequence of queries that
> > gives you what you want.
> >
> > (I would love to see someone from the group validate or squash all of
> this
> > theory and conjecture! I really want to be wrong here!)
> >
> > One really really wants to try this out in BigQuery. Unfortunately, in
> > spite
> > of promises, you still can¹t add hierarchical data to BigQuery. You can
> > only
> > use their pre-defined data sets which unfortunately doesn¹t have any
> > interesting data in it to allow you to tease out answers. The moment that
> > you use a flat structure then all of this hoopla goes away and you
> devolve
> > to SQL (which is what it seems that the public version of BigQuery only
> > supports at this moment).
> >
> > So we started thinking: what else does structured queries? XPath /
> XQuery.
> > XPath really only gives you subtrees which isn¹t that interesting. XQuery
> > immediately forces you into FLWOR. Having step 1 of writing a query
> > processor be FLWOR is depressing.
> >
> > So we kept looking. What about JSON? There¹s a ton of dead projects for
> > JSON
> > queries :(  We looked at MongoDB¹s query interface.  We looked at
> > CouchBase.
> > Ah! UnQL ( http://www.unqlspec.org/display/UnQL/Home ) and a ton of
> press
> > releases. But, oh wait, it seems to be dead (
> > http://www.arangodb.org/2012/04/07/is_unql_dead ). Oh boy!. Then we find
> > this AQL thing ( http://www.arangodb.org/manuals/Aql.html and more info
> at
> > http://www.arangodb.org/2012/04/19/rfc-the-avocadodb-query-language ).
> > Oddly, they had to resort to FLWOR-like structures too.
> >
> > (More to come ...)
> >
> >
> > --
> > Rob Grzywinski
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message