drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Constantine Peresypkin <pconstant...@gmail.com>
Subject Re: Stymied by queries
Date Mon, 27 Aug 2012 16:12:43 GMT
I think I understand what you want.
In BQL you write it as follows:

SELECT DocId, LAST(Name.Url) WITHIN Name AS Url
    FROM data
    WHERE Name.Url = 'whatever'

And then you can prune the NULL ones.

And if you think that this is a "problem", you haven't seen anything yet. :)

Constantine

On Mon, Aug 27, 2012 at 5:56 PM, Rob Grzywinski
<rob@aggregateknowledge.com>wrote:

> (And now that I really look at this, perhaps my outer query should be:
>
>    SELECT DocId
>      FROM data
>     WHERE Name IS NOT NULL
>
> (rather than "Name.Url IS NOT NULL") so that the level of the filter and
> project are the same.)
>
> (And yes, I was being loose with the "SELECT" part for illustrative
> purposes
> below. It should have been "SELECT DocId, Name.Url" to return the results
> that I show.)
>
> On 8/27/12 8:57 AM, "Rob Grzywinski" <rob@aggregateknowledge.com> wrote:
>
> > You're highlighting a fundamental concern that I have about the queries
> as
> > stated in the Dremel paper. Section 5 in the Dremel paper says:
> >
> > "Think of a nested record as a labeled tree, where each label
> corresponds to a
> > field name. The selection op-erator prunes away the branches of the tree
> that
> > do not satisfy the specified conditions. Thus, only those nested records
> are
> > retained where Name.Url is defined and starts with http."
> >
> > Notice that it says "only those *nested* records are retained" and not
> "only
> > those records are retained".  (This also follows if you work through the
> > algorithm in Figure 19 from the Dremel paper (what we've called the
> "trickle
> > reassembly").)
> >
> > Specifically,
> >
> >      SELECT DocId
> >        FROM data
> >       WHERE Name.Url = ³whatever²
> >
> > will return *all* records but prune those nested records where Name.Url
> > doesn't equal "whatever". For example if you had the data (in JSON'y
> format):
> >
> > { {"DocId":1, "Name":[{"Url":"A1"},{"Url","whatever"},{"Url":"C1"}]},
> >   {"DocId":2, "Name":[{"Url":"A2"}]} }
> >
> > then the above query would return:
> >
> > { {"DocId":1, "Name":[{"Url","whatever"}]},
> >   {"DocId":2, "Name":null} }
> >
> > To then get only those records that have Name.Url equal to "whatever" you
> > would need to wrap this with:
> >
> >      SELECT DocId
> >        FROM ( results )
> >       WHERE Name.Url IS NOT NULL
> >
> > which gives:
> >
> > { {"DocId":1, "Name":[{"Url","whatever"}]} }
> >
> > The inner query that you presented would result in:
> >
> >   SELECT DocId
> >     FROM data
> >    WHERE Name.Url IS NOT NULL
> >
> > { {"DocId":1, "Name":[{"Url":"A1"},{"Url","whatever"},{"Url":"C1"}]},
> >   {"DocId":2, "Name":[{"Url":"A2"}]} }
> >
> > (Nothing is restricted) And then the outer query is the same as my first
> query
> > above:
> >
> >      SELECT DocId
> >        FROM ( result )
> >       WHERE Name.Url = ³whatever²
> >
> > { {"DocId":1, "Name":[{"Url","whatever"}]},
> >   {"DocId":2, "Name":null} }
> >
> > I want to emphasize again that I really want to be wrong about my
> reading and
> > interpretation of the Dremel paper.
> >
> >
> > --
> > Rob Grzywinski
> >
> >
> >
> > On 8/26/12 9:46 PM, "Constantine Peresypkin" <pconstantine@gmail.com>
> wrote:
> >
> >> Hello Rob,
> >> I think you've meant
> >>
> >> SELECT DocId
> >>       FROM (
> >>                SELECT DocId
> >>                  FROM data
> >>                 WHERE Name.Url IS NOT NULL)
> >>       WHERE Name.Url = ³whatever²
> >>
> >> in the example query.
> >>
> >> In general you're right, you can only "optimize" queries that have
> nothing
> >> but "required" fields inside.
> >> But the whole point in Dremel columnar structure is: do a fast full-scan
> >> and then fast prune.
> >> That's why they have all those integer repetition/definition levels,
> they
> >> only read the levels, if possible, and make early "pruning" decision.
> >> But you can optimize the queries by other means, mainly by data
> >> "compression", in fact you just optimize full column scans.
> >> For example: do not store NULLs, use RLE encoding, dictionary encoding
> and
> >> "plain" compression.
> >> We even implemented some of these optimizations in OpenDremel.
> >>
> >> On Mon, Aug 27, 2012 at 1:50 AM, Rob Grzywinski
> >> <rob@aggregateknowledge.com>wrote:
> >>
> >>> I wanted to share some of the internal work that we had done on Dremel
> +
> >>> BigQuery-like queries to inject some momentum into the group.
> >>>
> >>> We finished a basic implementation of a field-stripe encoder and
> decoder
> >>> and
> >>> wanted to put a little effort into querying some data. (See
> >>> https://github.com/rgrzywinski/field-stripe/ )
> >>>
> >>> We had the following requirements:
> >>> * 1 pass over the data
> >>> * SAX-style output (result set) record writer (implying that you don¹t
> >>> cache
> >>> results < you simply hand them off as you project / select)
> >>> * support for select (filter), project and aggregate
> >>>
> >>> We quickly came to the realization that there is only one model that is
> >>> supported under these requirements: sub-tree pruning. (You have to
> spend
> >>> some time thinking about a schema tree and what happens if you have
> select
> >>> and project in different branches of that tree. You realize that there
> are
> >>> always cases in which you cannot evaluate your select before you have
> to
> >>> project. This boils down to the fact that will hand off a result and
> then
> >>> later realize that it should have been filtered. The only way that you
> can
> >>> prevent this is by saying that selects only affect themselves and their
> >>> children (i.e. ³sub-tree pruning²).) The problem with sub-tree pruning
> is
> >>> that it¹s 100% counter to everything that you¹ve ever done with SQL.
> For
> >>> example:
> >>>
> >>>     SELECT DocId
> >>>       FROM data
> >>>      WHERE Name.Url = ³whatever²
> >>>
> >>> (using the schema from the Dremel paper)
> >>>
> >>> That query would return *all records* but prune those sub-records that
> >>> don¹t
> >>> have Name.Url equal to ³whatever². What would be more natural (i.e.
> more
> >>> SQL-like) would be to filter the records that don¹t have Name.Url. But
> then
> >>> one immediately asks ³what about the sub-records²? Do you both filter
> the
> >>> records and prune the sub-trees? And what does ³=² mean if Name.Url is
> a
> >>> collection? Does it mean ³has only², ³contains², other?
> >>>
> >>> So how does Dremel solve all of this? They just do the sub-record/tree
> >>> purning and then wrap the query within another query to get what they
> want.
> >>> (And FYI < Dremel / BigQuery only support two table joins (where
> either or
> >>> both table can be a sub-select) --
> >>> https://developers.google.com/bigquery/docs/query-reference)
> >>>
> >>>     SELECT DocId
> >>>       FROM (
> >>>                SELECT DocId
> >>>                  FROM data
> >>>                WHERE Name.Url = ³whatever² )
> >>>      WHERE Name.Url IS NOT NULL
> >>>
> >>> (Or something like that) You can always make a sequence of queries that
> >>> gives you what you want.
> >>>
> >>> (I would love to see someone from the group validate or squash all of
> this
> >>> theory and conjecture! I really want to be wrong here!)
> >>>
> >>> One really really wants to try this out in BigQuery. Unfortunately, in
> >>> spite
> >>> of promises, you still can¹t add hierarchical data to BigQuery. You can
> >>> only
> >>> use their pre-defined data sets which unfortunately doesn¹t have any
> >>> interesting data in it to allow you to tease out answers. The moment
> that
> >>> you use a flat structure then all of this hoopla goes away and you
> devolve
> >>> to SQL (which is what it seems that the public version of BigQuery only
> >>> supports at this moment).
> >>>
> >>> So we started thinking: what else does structured queries? XPath /
> XQuery.
> >>> XPath really only gives you subtrees which isn¹t that interesting.
> XQuery
> >>> immediately forces you into FLWOR. Having step 1 of writing a query
> >>> processor be FLWOR is depressing.
> >>>
> >>> So we kept looking. What about JSON? There¹s a ton of dead projects for
> >>> JSON
> >>> queries :(  We looked at MongoDB¹s query interface.  We looked at
> >>> CouchBase.
> >>> Ah! UnQL ( http://www.unqlspec.org/display/UnQL/Home ) and a ton of
> press
> >>> releases. But, oh wait, it seems to be dead (
> >>> http://www.arangodb.org/2012/04/07/is_unql_dead ). Oh boy!. Then we
> find
> >>> this AQL thing ( http://www.arangodb.org/manuals/Aql.html and more
> info at
> >>> http://www.arangodb.org/2012/04/19/rfc-the-avocadodb-query-language ).
> >>> Oddly, they had to resort to FLWOR-like structures too.
> >>>
> >>> (More to come ...)
> >>>
> >>>
> >>> --
> >>> Rob Grzywinski
> >>>
> >>>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message