drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacques Nadeau <jacques.dr...@gmail.com>
Subject Re: Introduction
Date Sun, 20 Jan 2013 22:18:45 GMT
On Sun, Jan 20, 2013 at 11:39 AM, Siprell, Stefan
<stefan.siprell@exxeta.de>wrote:

> Hi Jacques,
> maybe I am over confident, but I think we have a great discussion going on
> here. Should we continue it using the developer mailing list, do we already
> have a policy on this?
>

As Ted said, this is a great place for this discussion.  I'm really
appreciate you pushing the boundaries and am personally finding it quite
useful.



> Either way I am enthusiastic to give my response. Purposely I drifted a
> bit away from the regular SQL syntax to provoke some creative discussion. I
> thought it would be easiest to think about the query language at the very
> beginning of the project before we settle on SQL. I will give my feedback
> on the suggested queries at the end of the mail.
>

Good idea.  Remember, we're not going to settle on one particular query
language.  Our goal is to just make the Logical Plan expressive enough to
support these concepts.  I really want super-SQL to be among the first
languages but the more the better.



> Using a select * on hierarchical data should absolutely return a deep copy
> of hierarchical data - granted. But I think that using the sql simply to
> prune branches from the hierarchical data is not desired, so you basically
> have to offer some kind of document translation sooner or later. Document
> translation languages tend to have a very verbose syntax and require lots
> of cumbersome coding - just look at XSLT. Hence my suggestion to make flat
> output a first class citizen, as the result and language constructs are
> familiar to its users. One can always add XQuery or something similar to
> deepen the data again. I don't want to over emphasize the ease of use
> issue, but the prime reason I am interested in Drill, that it allows real
> time processing of queries on large datasets. I am thinking of developers
> sitting in front of consoles and typing queries and firing them away, so I
> was really trying to define a compact and readable language. Remembering my
> XSLT times: i really was not all that efficient fiddling with this.
>
> Agreed.  To my perspective, we should try to add support for a reasonable
subset of use cases without destroying usability.


> I am not skeptical at all, that SQL will achieve the desired goals. I am
> just wondering if there is anything better :-)
>

There is room for both :)


> Feedback on the first query:
> Looking at the sub select recordsPerRevision, I wonder how the query
> builder would know that we want one entry per page. Is this calculated by
> the fact that in the path mediawiki.page.title and mediawiki.page.id both
> have mediawiki.page as the last non scalar path entry? What would happen,
> if we need two columns with different path depths? I would at least suggest
> to model this more explicitly using:
>
>  select
>    title as pageTitle,
>    id as pageId,
>    flatten(revision) as rev
>  from mediawiki.page
>
>
My thoughts were that a straight dot path (without repeated item indicators
'[]') would always be one per record.  As such, the concept of flatten is
clear here.  So explicit or implicit seems okay.  (For example,
mediawiki.page.revision.timestamp would actually return a null value.)

The multi path entry from would clearly point the reference for further
> references and give the user and query builder would immediately recognize
> how many results we expect for what. Can the query compiler cope with this
> heavy duty flatten operation in a select clause? We are basically running a
> join on flattened children of a node, and we describe this non-chalantly in
> the select clause and reference this in the order by statement. Possibly
> even used in aggregation function. I think this would be great, but will it
> be simple to implement? Google seems to see this in the from and not in the
> select claus as well.
>

Time will tell on implementability.  It kind of reminds me of a distinct
clause in traditional sql. The syntax simplicity belies the executional
complexity.


> I really like the within statement, but I am note sure if this works as
> expects. The where statement should only show revisions which occured at
> the time given, then we want to show information on the pages which
> contained these revisions, something like a right join. Within seems to
> work as a left join, showing us pages (parents) which do not have any
> children (flatten) as well. A minor correction for the within clause: it is
> not based on RECORD but on mediawiki.page.revision. As I mentioned, it
> might be confusing if a right, left, inner, outer join is being done in
> which basis. At least for a dummy like me :-)
>

Good point. Reminds me of the difference between where and having.  Not
sure what the right solution is with regards to contextually correct
filtering clause.  I'm kind of driven back towards a subquery solution.
 The optimizer should be able to smash them back together on execution even
if they were built as a logical subquery.


> I am again confused on the partitioned aggregation.Is there where
> statement executed logically before or after the aggregation? Same thing on
> the last suggested query. I can very very clearly see what you trying to
> achieve, but I am only human. Without having ever written a query planner,
> I am uncertain if the machine can resolve ambiguities on how the from,
> aggregation, select and join operations work together.
>

Yeah, same problem here.  This becomes even more problematic when we are
unsure of the schema until we actually start running the query (e.g.
running a first time query against json).


>
> If we want to stick to a SQL dialect, then we might as well copy the
> BigQuery from Google syntax. If the google api has some shortcoming, we
> should perhaps address this and explicitly name the issues. Perhaps I
> should map the XML to a RBMS and write the same queries in the appropriate
> SQL. But I would really prefer to write the query in the Drill Logical
> Query Plan to be more precise. Do we have some examples or complete
> definition I can get my hands on?
>

I feel like big query maybe strayed away from the spec a little with the
windowing/partitioning stuff.  Otherwise, BigQuery is not far off.  If you
want to write directly to Logical Plan syntax, that would be very helpful.
  The syntax is the best place to start [1].  I made an example query
previously at [2].  Logical reference interpreter is alpha at [3].



[1]
https://docs.google.com/a/maprtech.com/document/d/1QTL8warUYS2KjldQrGUse7zp8eA72VKtLOHwfXy6c7I/edit
[2] https://github.com/jacques-n/incubator-drill/wiki/SQL-Example
[3] https://github.com/jacques-n/incubator-drill.

>
> Stefan
>
>
>
>
>
>
>
> On 20.01.2013, at 19:30, Jacques Nadeau <jacques.drill@gmail.com> wrote:
>
> > I spent a little time looking at your first query.  I think, for all the
> > queries, writing a little more description of the query goals would be
> > helpful to ensure that I'm not misinterpreting your objective.
> >
> > select rev.::parent.title, rev.::parent.id, sum(rev.text.bytes)
> > from mediawiki.page.revision as rev
> > where rev.timestamp.between(?, ?)
> > group by rev.::parent;
> >
> > If I were trying to make it more SQL'y, I'd probably go with something
> like:
> >
> > select
> >  pageTitle,
> >  pageId,
> >  sum(rev.bytes) as totalChanges
> > from (
> >  select
> >    mediawiki.page.title as pageTitle,
> >    mediawiki.page.id as pageId,
> >    flatten(media.page.revision) as rev
> >  from mediawiki
> >  where rev.timestamp between ? and ?
> > ) as recordPerRevision
> > group by pageTitle, pageId
> > order by totalChanges desc
> >
> > A cleaner alternative would be providing the more complicated WITHIN
> syntax
> > as BigQuery does:
> >
> > select
> >    mediawiki.page.title as pageTitle,
> >    mediawiki.page.id as pageId,
> >    sum(media.page.revision.bytes) as totalChanges within RECORD
> >  from mediawiki
> >  where rev.timestamp between ? and ?
> >
> >
> > Or extending the SQL2003 windowing functions such as that partitioning
> > within a single record is possible and then makes the aggregating
> functions
> > use relative references.
> >
> > select
> >    mediawiki.page.title as pageTitle,
> >    mediawiki.page.id as pageId,
> >    sum(bytes) as totalChanges OVER(PARTITION BY mediawiki.page.revision)
> >  where rev.timestamp between ? and ?
> >
> > Or providing the simple approach, providing a specialized 'scalar'
> > function: ARRAY_SUM(array_node_to_iterate_over,
> > expression_to_evaluate_on_each_iterated_value) function:
> >
> > select
> >    mediawiki.page.title as pageTitle,
> >    mediawiki.page.id as pageId,
> >    ARRAY_SUM(media.page.revision, bytes) as totalChanges
> >  where rev.timestamp between ? and ?
> >
> >
> >>
> >> I also understood drill was more of an analytical platform. So my
> >> understanding is that we want to access hierarchical data, but we do not
> >> want to generate any. Besides trying to run reports, charts or tables
> >> (typical client applications) on hierarchical data is a mess, as the
> >> toolset simply doesn't support it. Out of this reason, I would focus on
> >> generating flat result for the time being.
> >>
> >>
> > I think this is a really great point. It made me question some of the
> > assumptions I had been operating on.  That being said, I'd like to hold
> off
> > on trimming that tree entirely for the time being.  I'm concerned doing
> so
> > would substantial the effectiveness of ever using nested datasets with
> it.
> > For example, if I do select * from a nested dataset, I really want to see
> > hierarchical data returned.  In the case of building up a single query
> on a
> > number of sub queries, I can see many useful situations where the
> > intermediate queries still maintain hierarchical datasets, even if the
> > final goal output might be a flat data structure for analytical tool use.
> >
> >
> >
> >> If desired I can start writing an ANTLR grammar on the stuff I am
> working
> >> on, to make the output more robust. I had a look at the SQL parser you
> guys
> >> mentioned, but I don't think this would work on my kind of queries, as
> it
> >> drastically expands SQL 2003. All we want to do is to map the AST to
> your
> >> logic plan? I think this can be done quite easily just using ANTLR and
> some
> >> Java classes.
> >>
> >
> > If you want to build a simple query language that generates logical
> plans,
> > that would be interesting.  Given my rewrites, are you still skeptical of
> > minimally extending SQL 2003?
> >
> > Jacques
> >
> >
> >>
> >> Stefan
> >>
> >> On 20.01.2013, at 00:56, Jacques Nadeau <jacques.drill@gmail.com>
> wrote:
> >>
> >>> Many of these haven't been finalized since we're still working on code.
> >>> That being said, let me share what my thoughts have been to date.
> >>>
> >>>> SQL Row maps to a drill record?
> >>> Correct
> >>>
> >>>> And drill would not have a flat sibling structure of nodes, a.k.a.
> >> columns
> >>> but hierarchical nodes?
> >>> Correct.  My general thinking is that a record is a DataValue.
> >>> A DataValue can be one of three major types: a map (string:DataValue),
> an
> >>> ordered list (DataValues[]), or a scalar DataValue.  Most commonly, the
> >>> first DataValue in a record would be a map.  In the case of SQL/flat
> data
> >>> (e.g. CSV), this map would only contain scalar values.
> >>>
> >>>> Will drill access the contents of a record in a stream or document
> >> manner?
> >>> How large may i record be?
> >>> For the first version of Drill, I was thinking that a record must fit
> >>> entirely in memory.  Functions can interact with an entire record as
> they
> >>> choose.
> >>>
> >>>> Can i use Xpath like functions to acces nodes?
> >>> Generally, we hope so.  'Like' being the operative word here.  The path
> >>> expressions that we're thinking of using are substantially simpler than
> >> the
> >>> expressiveness of xpath.  Ultimately, I could see people creating a
> >> parser
> >>> which takes in xquerys and converts them to Drill logical plans.  That
> >>> being said, our goal is more for analytical queries than document
> >>> transformations.
> >>>
> >>>> All of the google bigquery Cook Book Examples seem to generate flat
> >>> Output, is this a limitation?
> >>> In Drill, we don't plan to limit to flat output.  For v1, we're looking
> >> at
> >>> supporting hierarchical expressions in sql 'as' aliases.  We're also
> >>> looking at supporting selections at any level of hierarchy, not just
> the
> >>> leaf level.  We then combine these with a concept of collision behavior
> >>> control so that you can control how to merge multiple nested out values
> >>> into a single output tree.  These will allow one to build a nested
> output
> >>> object.  These are preliminary thoughts.  We need to write more and
> >> discuss
> >>> more.
> >>>
> >>> One thing to remember is that one of Drill's goals is to be flexible.
> >>> Ultimately, different query languages may support different subsets of
> >>> operations and no one query language may include all operators.
> >>>
> >>> Hope that makes sense.
> >>>
> >>> Jacques
> >>>
> >>> On Sat, Jan 19, 2013 at 3:11 PM, Siprell, Stefan
> >>> <stefan.siprell@exxeta.de>wrote:
> >>>
> >>>> Aaaah studying the Big query docs helped. I may assume, that a SQL Row
> >>>> maps to a drill record? And drill would not have a flat sibling
> >> structure
> >>>> of nodes, a.k.a. columns but hierarchical nodes?   All of the google
> >>>> bigquery Cook Book Examples seem to generate flat Output, is this a
> >>>> limitation? If not how would i generate my hierarchical Output Model,
> >>>> without using a groovy builder or xquery :-)
> >>>>
> >>>>
> >>>> Stefan
> >>>>
> >>>> Von meinem iPad gesendet
> >>>>
> >>>> Am 20.01.2013 um 00:01 schrieb "Jacques Nadeau" <
> >> jacques.drill@gmail.com>:
> >>>>
> >>>>> Fair enough.  Starting with big query syntax or SQL 2003 and flat
> data
> >>>>> structures will work fine.  I'll try to write something meaningful
up
> >>>> about
> >>>>> sql and nested data structures.
> >>>>>
> >>>>> Jacques
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Sat, Jan 19, 2013 at 2:54 PM, Siprell, Stefan
> >>>>> <stefan.siprell@exxeta.de>wrote:
> >>>>>
> >>>>>> Should I not just use this here as a reference?
> >>>>>>
> >>>>>> https://developers.google.com/bigquery/docs/query-reference
> >>>>>>
> >>>>>> I am a bit stumped to be honest. I am trying to think how to
use SQL
> >>>>>> efficiently on Nested Data sturctures.
> >>>>>>
> >>>>>> Von meinem iPad gesendet
> >>>>>>
> >>>>>> Am 19.01.2013 um 19:51 schrieb "Jacques Nadeau" <
> >>>> jacques.drill@gmail.com
> >>>>>> <mailto:jacques.drill@gmail.com>>:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> * I drew a UML diagram. I saw that there is some glifffy support
in
> >>>>>> confluenc,e but the free account is pretty much useless. I used
omni
> >>>>>> graffle to draw the diagram, but this is payware on the mac
- is
> there
> >>>> some
> >>>>>> usable freeware alternative? Don't mention tigris :-)
> >>>>>>
> >>>>>>
> >>>>>> I don't have any suggestions on this.
> >>>>>>
> >>>>>>
> >>>>>> * I have some ideas on the queries, but I am not sure how I
should
> >>>> specify
> >>>>>> them? Should I use pseudo SQL? Prose? I saw the syntax document
on
> the
> >>>>>> server, it it mature enough, that I attempt to use its syntax?
Is
> >> there
> >>>> a
> >>>>>> BNF or better ANTLR grammar I can use to check my syntax? Should
I
> >> draw
> >>>> one
> >>>>>> up while I am at it?
> >>>>>>
> >>>>>>
> >>>>>> I suggest you target SQL2003 (including subqueries).  We're
looking
> at
> >>>> how
> >>>>>> to use Optiq's SQL parser for Drill.  Our goal is to stay as
close
> as
> >>>>>> possible to that spec but add the following extensions:
> >>>>>> - Add flatten operator similar to BigQuery syntax
> >>>>>> - Support use of selection and output identifiers using
> >> dotted/bracketed
> >>>>>> notation.  E.g. "select person.children[0].age as
> >>>>>> output.profile.firstChildAge"
> >>>>>> - Support new functions that can accept nested values including
> >>>> collections
> >>>>>> and maps.  For example "select ARRAY_LENGTH(person.children)".
> >>>>>>
> >>>>>> Once you have some sql examples, the next goal would be to manually
> >>>>>> translate those into Logical Plan syntax.  This syntax is still
> >>>> maturing so
> >>>>>> I'd take it to the SQL stage first.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Stefan
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 19.01.2013, at 02:05, Jacques Nadeau <jacques.drill@gmail.com
> >>>> <mailto:
> >>>>>> jacques.drill@gmail.com>> wrote:
> >>>>>>
> >>>>>> The wiki is up.  Michael and Stefan, it would be great if you
> started
> >>>>>> putting your use case thoughts there.
> >>>>>>
> >>>>>> Jacques
> >>>>>>
> >>>>>> On Sun, Jan 13, 2013 at 3:31 PM, Ted Dunning <ted.dunning@gmail.com
> >>>>>> <mailto:ted.dunning@gmail.com>>
> >>>>>> wrote:
> >>>>>>
> >>>>>> Ahh... yes.  That wiki.  I will ping infra again.
> >>>>>>
> >>>>>> (I was attaching your comment to the wikipedia use case and
had
> >> confused
> >>>>>> myself)
> >>>>>>
> >>>>>> On Sun, Jan 13, 2013 at 2:53 PM, Michael Hausenblas <
> >>>>>> michael.hausenblas@gmail.com<mailto:michael.hausenblas@gmail.com>>
> >>>> wrote:
> >>>>>>
> >>>>>>
> >>>>>> What do you need from me?
> >>>>>>
> >>>>>> Maybe I've overlooked something in which case I apologize -
was
> >>>>>> wondering
> >>>>>> if the public Wiki for Drill is available where Stefan, I and
others
> >>>>>> can
> >>>>>> write up the UC and queries.
> >>>>>>
> >>>>>> Cheers,
> >>>>>>            Michael
> >>>>>>
> >>>>>> --
> >>>>>> Michael Hausenblas
> >>>>>> Ireland, Europe
> >>>>>> http://mhausenblas.info/
> >>>>>>
> >>>>>> On 13 Jan 2013, at 14:20, Ted Dunning <ted.dunning@gmail.com
> <mailto:
> >>>>>> ted.dunning@gmail.com>> wrote:
> >>>>>>
> >>>>>> What do you need from me?
> >>>>>>
> >>>>>>
> >>>>>> On Sun, Jan 13, 2013 at 11:06 AM, Michael Hausenblas <
> >>>>>> michael.hausenblas@gmail.com<mailto:michael.hausenblas@gmail.com>>
> >>>> wrote:
> >>>>>>
> >>>>>> as soon as we hear back from Ted re the Wiki we work there.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message