drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hyunsik Choi <hyun...@apache.org>
Subject Re: DrQL grammar and parser
Date Tue, 28 Aug 2012 06:13:04 GMT
Dear Camuel,

I have a different view to you. Dremel's query language has a new
projection model specialized to a nested data model, which is very useful
to newly emerged applications. A Dremel query does not represents
any physical execution.

Besides, I don't agree that a query language itself is a logical plan. In
some aspect, this is right. However, even though it is possible, this way
makes implementation complicated. Consider the projection push-down
optimization. Since Drill uses columnar storage, optimizing projection is
necessary. In most cases, operators (e.g., scan, selection, join, and
aggregation) of a query have different input and output schemas. Drill has
to compute optimal in/out schemas for each operators. In addition,
expressions (e.g., algebraic expression, general function, and aggregation
function) of a projection list have to be evaluated in relevant operators.
it is also computed by the projection push down optimization. It is hard to
imagine how we optimize the projection by using only an AST. Finally, you
would need some intermediate form of the query or additional data
structures that act as a part of a logical plan.

P.S. we need to note that this discussion was started by the question of
Ted Dunning, which assumes some people may want to additional query
languages besides DrQL. However, our discussion is far away from the start
of this discussion.

Best regards,
--
Hyunsik Choi

On Tue, Aug 28, 2012 at 12:18 PM, Camuel Gilyadov <camuel@gmail.com> wrote:

> Well, we need first to define the scope of Drill than we could talk more
> intelligently.
>
> If we are talking strictly about Dremel, then the whole point about
> Dremel's nested data is physical execution model, nesting is in fact an
> index for the join or a kind of pre-join. Dremel queries dictates physical
> execution without leaving much choice. Moreover, all query plans are full
> table scans with the same pattern: PROJECT=>SELECT=>AGGREGATE
>
> There is no other tables and no indexes to enable choice. After the paper
> was published they added 'having' and 'small table joins' features but this
> still is minor.
>
> I just not seeing anything that could be "logical plan". May be it will be
> enough to make one query language rich enough that everything could be
> directly accessed in that language and then all other tools would just
> generate queries in that language? DrQL for example could be such language.
> The this language itself would be a "logical plan". Could this work? As an
> advantage, ordinary users would also have access to it.
>
> In OpenDremel we do have query-plans (one can check OpenDremel slide deck
> for more info). In OpenDremel the query plan is DAG of scalar operators,
> could be even optimized using math rules (like constants pre-calculation)
> and predicate-pushdown rule. However, I doubt this plan could be called
> 'logical' and I am undecided whether it should be a stable abstraction.
>
> -----------------
>
> Bottom line, Dremel is not about logical models, high-level optimization
> and full-scan avoidance, Quite the opposite: Dremel is all about brute
> force full scan of denormalized single table and all the energy is devoted
> to make that brute force scan efficient and fast.
>
> On Tue, Aug 28, 2012 at 4:26 AM, Hyunsik Choi <hyunsik@apache.org> wrote:
>
> > Hi Camuel,
> >
> > I think that it may be not an upfront formalization if we focus on a
> > logical plan model. Since a logical plan model is generally equivalent to
> > an algebra, we don't need to concern physical execution methods (i.e.,
> > iteration models over nested data set).
> >
> > --
> > Hyunsik Choi
> >
> > On Mon, Aug 27, 2012 at 7:07 PM, Camuel Gilyadov <camuel@gmail.com>
> wrote:
> >
> > > Of course optimizer must work with some intermediate form of query,
> > which I
> > > think could be an object graph expressed for now by using ordinary
> > > programming language objects without much fuss (like Java or Python).
> > > However, I think its upfront formalization is going to develop into
> > > never-ending story and mainly because iteration over nested datasets is
> > far
> > > from being clear, different nested dataset languages are not only
> differs
> > > in syntax but also in the iteration model itself. I think Rob
> > > Grzywinski started this discussion in separate thread already...
> > >
> > > Also regarding optimizer and DAG, as there is not much index/joins
> action
> > > going on what we have now is mostly a chain of transformations
> > > which of-course is also formally a DAG :) but thinking about it as a
> > chain
> > > will simplify things at first, there is no much optimizations you can
> do
> > > with it. If you go one level down and consider scalar operations then
> you
> > > get more elaborate DAG of course.
> > > -----------------
> > > Let's separate issues:
> > >
> > > 1. Query Plan is distributed workload and that must be formalized and I
> > > think no one suggests otherwise. Also no one suggest other model than
> DAG
> > > except me. I suggest unrestricted graph just to keep backend useful for
> > > other stuff, for the purposes of DrQL DAG is more than adequate in my
> > > opinion. However, this is a DAG of identical nodes, and it follows
> > physical
> > > data partitioning. Let's label it physical DAG in order not to confuse
> > with
> > > logical query plan.
> > >
> > > 2. Open and somewhat confused issue is what actually runs a single node
> > of
> > > the above mentioned physical DAG? Is this a formalized query plan or
> just
> > > arbitrary code?
> > >
> > > 3. Query plan formalization: main obstacle here is that the model of
> > > iterating nested datasets are far from clear. Particularly nor Dremel
> > paper
> > > neither BigQuery reference describe well the behavior of querying
> nested
> > > datasets with all different subcases. There many other languages to
> query
> > > nested data but the iteration model varies significantly between them.
> > For
> > > formalization we miss one another academic paper which
> > > would rigorously define canonical high-performance iteration model for
> > > nested datasets.
> > >
> > > 4. Another complicating factor is columnar optimization. Drill is going
> > to
> > > be nested-columnar engine and as such part of query plan must be
> > columnar.
> > > So full set of column-oriented and record-oriented primitives are
> needed
> > > record-construction primitives.
> > >
> > >
> > > On Mon, Aug 27, 2012 at 9:29 AM, Hyunsik Choi <hyunsik.choi@gmail.com
> > > >wrote:
> > >
> > > > Hi David,
> > > >
> > > > I agree with some of your claims. I also think that now DrQL may be
> > > enough
> > > > to Drill project.
> > > >
> > > > Even if we don't support various query languages, I think complex
> query
> > > > languages (like SQL and DrQL) should have an logical form in order to
> > > deal
> > > > a given query without considering actual physical information. It
> > > provides
> > > > an easy way to modify the query to be more optimized one (e.g.,
> pushing
> > > > down projection, selection, and finding the best operator order)
> while
> > > the
> > > > optimized one is logically equivalent to the original query.
> > > >
> > > > Also, It would not hurt performance. For example, OLTP that
> processes a
> > > > query within a few milliseconds already employs such a logical plan
> > > > model. Although a logical plan is generic, it is not hugely different
> > to
> > > > existing logical plan models.
> > > >
> > > > --
> > > > Hyunsik Choi
> > > >
> > > > On Mon, Aug 27, 2012 at 2:34 PM, David Gruzman <
> david@bigdatacraft.com
> > > > >wrote:
> > > >
> > > > > Hi,
> > > > > Dremel is high performance system. I think building something
> generic
> > > > > "inter-languages" will hurt performance.
> > > > > Having generic executor service we can add several different
> > paradigms
> > > of
> > > > > the local computation (and even not local). But I think
> > > > > SQL like query language should be done in most efficient way.
> > > > > David
> > > > >
> > > > > On Mon, Aug 27, 2012 at 3:20 AM, Hyunsik Choi <hyunsik@apache.org>
> > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > How about having a generic logical plan described as a DAG,
where
> > > each
> > > > > > vertex indicates a logical operator including various annotations
> > and
> > > > > each
> > > > > > edge represents a data flow. A DAG has much expressive power.
> Many
> > > > > > literatures have shown that most logical plans of various data
> > > > > manipulation
> > > > > > languages can be described as such a DAG.
> > > > > >
> > > > > > Additional languages have different ASTs, and they can be
> > transformed
> > > > > into
> > > > > > the generic logical plan. In this case, we can reuse logical
> plan,
> > > > > logical
> > > > > > plan optimization, and physical execution plan. Besides, Drill
> may
> > > > > consider
> > > > > >  a global plan that represents the distributed execution plan.
> > Since
> > > > the
> > > > > > global plan generally depends on the logical plan, we can also
> > reuse
> > > > all
> > > > > > code related to the global plan.
> > > > > >
> > > > > > --
> > > > > > Hyunsik Choi
> > > > > >
> > > > > >
> > > > > > On Mon, Aug 27, 2012 at 6:22 AM, Ted Dunning <
> > ted.dunning@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Camuel,
> > > > > > >
> > > > > > > Do you have a grammar test suite that demonstrates the
range of
> > > > > > > expressions?
> > > > > > >
> > > > > > > Also, I believe that some have a goal to use additional
> languages
> > > > > besides
> > > > > > > SQL like languages.  A limited version of pig, for instance,
> > would
> > > be
> > > > > > very
> > > > > > > interesting.  To do this, it will be important to have
a
> logical
> > > plan
> > > > > > > structure that is common for different syntaxes and is
not
> > limited
> > > to
> > > > > the
> > > > > > > idiosyncracies of any particular syntax.
> > > > > > >
> > > > > > > How do you think that should be handled?  Do you have an
idea
> > for a
> > > > > > logical
> > > > > > > plan structure?
> > > > > > >
> > > > > > > On Sun, Aug 26, 2012 at 4:11 PM, Camuel Gilyadov <
> > camuel@gmail.com
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > I've written and attached ANTLR grammar for DrQL which
I
> assume
> > > is
> > > > > same
> > > > > > > as
> > > > > > > > BigQuery language described in Query Reference on
BigQuery
> > > website.
> > > > > > This
> > > > > > > > grammar includes AST production rules.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message