drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Camuel Gilyadov <cam...@gmail.com>
Subject Re: DrQL grammar and parser
Date Tue, 28 Aug 2012 12:03:15 GMT
I think different views are essential for a fruitful brainstorming :)

However, I never suggested optimizing AST and bypassing query-plan
abstraction altogether. I was suggesting:

1. Not formalizing query plan, especially not upfront formalizing it.
2. Not calling query plan - logical query plan.
3. Not making query plan a stable abstraction/interface of the back-end and
then having multiple front-end generating it.
4. Heavily downplaying query-optimization in Dremel as not even remotely as
important as with traditional DBMS.
5. To consider making DrQL - main language of the Drill in which everything
Drill is capable of - could be expressed and then having other frontends
generating directly DrQL. For this DrQL must be extended. Added benefit
would be that user would have access to it also.
6. To consider making backend as generic as possible so for more innovative
frontends it would be possible to bypass main frontend and talk directly to
backend.



On Tue, Aug 28, 2012 at 9:13 AM, Hyunsik Choi <hyunsik@apache.org> wrote:

> Dear Camuel,
>
> I have a different view to you. Dremel's query language has a new
> projection model specialized to a nested data model, which is very useful
> to newly emerged applications. A Dremel query does not represents
> any physical execution.
>
> Besides, I don't agree that a query language itself is a logical plan. In
> some aspect, this is right. However, even though it is possible, this way
> makes implementation complicated. Consider the projection push-down
> optimization. Since Drill uses columnar storage, optimizing projection is
> necessary. In most cases, operators (e.g., scan, selection, join, and
> aggregation) of a query have different input and output schemas. Drill has
> to compute optimal in/out schemas for each operators. In addition,
> expressions (e.g., algebraic expression, general function, and aggregation
> function) of a projection list have to be evaluated in relevant operators.
> it is also computed by the projection push down optimization. It is hard to
> imagine how we optimize the projection by using only an AST. Finally, you
> would need some intermediate form of the query or additional data
> structures that act as a part of a logical plan.
>
> P.S. we need to note that this discussion was started by the question of
> Ted Dunning, which assumes some people may want to additional query
> languages besides DrQL. However, our discussion is far away from the start
> of this discussion.
>
> Best regards,
> --
> Hyunsik Choi
>
> On Tue, Aug 28, 2012 at 12:18 PM, Camuel Gilyadov <camuel@gmail.com>
> wrote:
>
> > Well, we need first to define the scope of Drill than we could talk more
> > intelligently.
> >
> > If we are talking strictly about Dremel, then the whole point about
> > Dremel's nested data is physical execution model, nesting is in fact an
> > index for the join or a kind of pre-join. Dremel queries dictates
> physical
> > execution without leaving much choice. Moreover, all query plans are full
> > table scans with the same pattern: PROJECT=>SELECT=>AGGREGATE
> >
> > There is no other tables and no indexes to enable choice. After the paper
> > was published they added 'having' and 'small table joins' features but
> this
> > still is minor.
> >
> > I just not seeing anything that could be "logical plan". May be it will
> be
> > enough to make one query language rich enough that everything could be
> > directly accessed in that language and then all other tools would just
> > generate queries in that language? DrQL for example could be such
> language.
> > The this language itself would be a "logical plan". Could this work? As
> an
> > advantage, ordinary users would also have access to it.
> >
> > In OpenDremel we do have query-plans (one can check OpenDremel slide deck
> > for more info). In OpenDremel the query plan is DAG of scalar operators,
> > could be even optimized using math rules (like constants pre-calculation)
> > and predicate-pushdown rule. However, I doubt this plan could be called
> > 'logical' and I am undecided whether it should be a stable abstraction.
> >
> > -----------------
> >
> > Bottom line, Dremel is not about logical models, high-level optimization
> > and full-scan avoidance, Quite the opposite: Dremel is all about brute
> > force full scan of denormalized single table and all the energy is
> devoted
> > to make that brute force scan efficient and fast.
> >
> > On Tue, Aug 28, 2012 at 4:26 AM, Hyunsik Choi <hyunsik@apache.org>
> wrote:
> >
> > > Hi Camuel,
> > >
> > > I think that it may be not an upfront formalization if we focus on a
> > > logical plan model. Since a logical plan model is generally equivalent
> to
> > > an algebra, we don't need to concern physical execution methods (i.e.,
> > > iteration models over nested data set).
> > >
> > > --
> > > Hyunsik Choi
> > >
> > > On Mon, Aug 27, 2012 at 7:07 PM, Camuel Gilyadov <camuel@gmail.com>
> > wrote:
> > >
> > > > Of course optimizer must work with some intermediate form of query,
> > > which I
> > > > think could be an object graph expressed for now by using ordinary
> > > > programming language objects without much fuss (like Java or Python).
> > > > However, I think its upfront formalization is going to develop into
> > > > never-ending story and mainly because iteration over nested datasets
> is
> > > far
> > > > from being clear, different nested dataset languages are not only
> > differs
> > > > in syntax but also in the iteration model itself. I think Rob
> > > > Grzywinski started this discussion in separate thread already...
> > > >
> > > > Also regarding optimizer and DAG, as there is not much index/joins
> > action
> > > > going on what we have now is mostly a chain of transformations
> > > > which of-course is also formally a DAG :) but thinking about it as a
> > > chain
> > > > will simplify things at first, there is no much optimizations you can
> > do
> > > > with it. If you go one level down and consider scalar operations then
> > you
> > > > get more elaborate DAG of course.
> > > > -----------------
> > > > Let's separate issues:
> > > >
> > > > 1. Query Plan is distributed workload and that must be formalized
> and I
> > > > think no one suggests otherwise. Also no one suggest other model than
> > DAG
> > > > except me. I suggest unrestricted graph just to keep backend useful
> for
> > > > other stuff, for the purposes of DrQL DAG is more than adequate in my
> > > > opinion. However, this is a DAG of identical nodes, and it follows
> > > physical
> > > > data partitioning. Let's label it physical DAG in order not to
> confuse
> > > with
> > > > logical query plan.
> > > >
> > > > 2. Open and somewhat confused issue is what actually runs a single
> node
> > > of
> > > > the above mentioned physical DAG? Is this a formalized query plan or
> > just
> > > > arbitrary code?
> > > >
> > > > 3. Query plan formalization: main obstacle here is that the model of
> > > > iterating nested datasets are far from clear. Particularly nor Dremel
> > > paper
> > > > neither BigQuery reference describe well the behavior of querying
> > nested
> > > > datasets with all different subcases. There many other languages to
> > query
> > > > nested data but the iteration model varies significantly between
> them.
> > > For
> > > > formalization we miss one another academic paper which
> > > > would rigorously define canonical high-performance iteration model
> for
> > > > nested datasets.
> > > >
> > > > 4. Another complicating factor is columnar optimization. Drill is
> going
> > > to
> > > > be nested-columnar engine and as such part of query plan must be
> > > columnar.
> > > > So full set of column-oriented and record-oriented primitives are
> > needed
> > > > record-construction primitives.
> > > >
> > > >
> > > > On Mon, Aug 27, 2012 at 9:29 AM, Hyunsik Choi <
> hyunsik.choi@gmail.com
> > > > >wrote:
> > > >
> > > > > Hi David,
> > > > >
> > > > > I agree with some of your claims. I also think that now DrQL may
be
> > > > enough
> > > > > to Drill project.
> > > > >
> > > > > Even if we don't support various query languages, I think complex
> > query
> > > > > languages (like SQL and DrQL) should have an logical form in order
> to
> > > > deal
> > > > > a given query without considering actual physical information. It
> > > > provides
> > > > > an easy way to modify the query to be more optimized one (e.g.,
> > pushing
> > > > > down projection, selection, and finding the best operator order)
> > while
> > > > the
> > > > > optimized one is logically equivalent to the original query.
> > > > >
> > > > > Also, It would not hurt performance. For example, OLTP that
> > processes a
> > > > > query within a few milliseconds already employs such a logical plan
> > > > > model. Although a logical plan is generic, it is not hugely
> different
> > > to
> > > > > existing logical plan models.
> > > > >
> > > > > --
> > > > > Hyunsik Choi
> > > > >
> > > > > On Mon, Aug 27, 2012 at 2:34 PM, David Gruzman <
> > david@bigdatacraft.com
> > > > > >wrote:
> > > > >
> > > > > > Hi,
> > > > > > Dremel is high performance system. I think building something
> > generic
> > > > > > "inter-languages" will hurt performance.
> > > > > > Having generic executor service we can add several different
> > > paradigms
> > > > of
> > > > > > the local computation (and even not local). But I think
> > > > > > SQL like query language should be done in most efficient way.
> > > > > > David
> > > > > >
> > > > > > On Mon, Aug 27, 2012 at 3:20 AM, Hyunsik Choi <
> hyunsik@apache.org>
> > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > How about having a generic logical plan described as a
DAG,
> where
> > > > each
> > > > > > > vertex indicates a logical operator including various
> annotations
> > > and
> > > > > > each
> > > > > > > edge represents a data flow. A DAG has much expressive
power.
> > Many
> > > > > > > literatures have shown that most logical plans of various
data
> > > > > > manipulation
> > > > > > > languages can be described as such a DAG.
> > > > > > >
> > > > > > > Additional languages have different ASTs, and they can
be
> > > transformed
> > > > > > into
> > > > > > > the generic logical plan. In this case, we can reuse logical
> > plan,
> > > > > > logical
> > > > > > > plan optimization, and physical execution plan. Besides,
Drill
> > may
> > > > > > consider
> > > > > > >  a global plan that represents the distributed execution
plan.
> > > Since
> > > > > the
> > > > > > > global plan generally depends on the logical plan, we can
also
> > > reuse
> > > > > all
> > > > > > > code related to the global plan.
> > > > > > >
> > > > > > > --
> > > > > > > Hyunsik Choi
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Aug 27, 2012 at 6:22 AM, Ted Dunning <
> > > ted.dunning@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Camuel,
> > > > > > > >
> > > > > > > > Do you have a grammar test suite that demonstrates
the range
> of
> > > > > > > > expressions?
> > > > > > > >
> > > > > > > > Also, I believe that some have a goal to use additional
> > languages
> > > > > > besides
> > > > > > > > SQL like languages.  A limited version of pig, for
instance,
> > > would
> > > > be
> > > > > > > very
> > > > > > > > interesting.  To do this, it will be important to
have a
> > logical
> > > > plan
> > > > > > > > structure that is common for different syntaxes and
is not
> > > limited
> > > > to
> > > > > > the
> > > > > > > > idiosyncracies of any particular syntax.
> > > > > > > >
> > > > > > > > How do you think that should be handled?  Do you have
an idea
> > > for a
> > > > > > > logical
> > > > > > > > plan structure?
> > > > > > > >
> > > > > > > > On Sun, Aug 26, 2012 at 4:11 PM, Camuel Gilyadov <
> > > camuel@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > I've written and attached ANTLR grammar for DrQL
which I
> > assume
> > > > is
> > > > > > same
> > > > > > > > as
> > > > > > > > > BigQuery language described in Query Reference
on BigQuery
> > > > website.
> > > > > > > This
> > > > > > > > > grammar includes AST production rules.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message