drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Camuel Gilyadov <cam...@gmail.com>
Subject Re: DrQL grammar and parser
Date Tue, 28 Aug 2012 03:18:08 GMT
Well, we need first to define the scope of Drill than we could talk more
intelligently.

If we are talking strictly about Dremel, then the whole point about
Dremel's nested data is physical execution model, nesting is in fact an
index for the join or a kind of pre-join. Dremel queries dictates physical
execution without leaving much choice. Moreover, all query plans are full
table scans with the same pattern: PROJECT=>SELECT=>AGGREGATE

There is no other tables and no indexes to enable choice. After the paper
was published they added 'having' and 'small table joins' features but this
still is minor.

I just not seeing anything that could be "logical plan". May be it will be
enough to make one query language rich enough that everything could be
directly accessed in that language and then all other tools would just
generate queries in that language? DrQL for example could be such language.
The this language itself would be a "logical plan". Could this work? As an
advantage, ordinary users would also have access to it.

In OpenDremel we do have query-plans (one can check OpenDremel slide deck
for more info). In OpenDremel the query plan is DAG of scalar operators,
could be even optimized using math rules (like constants pre-calculation)
and predicate-pushdown rule. However, I doubt this plan could be called
'logical' and I am undecided whether it should be a stable abstraction.

-----------------

Bottom line, Dremel is not about logical models, high-level optimization
and full-scan avoidance, Quite the opposite: Dremel is all about brute
force full scan of denormalized single table and all the energy is devoted
to make that brute force scan efficient and fast.

On Tue, Aug 28, 2012 at 4:26 AM, Hyunsik Choi <hyunsik@apache.org> wrote:

> Hi Camuel,
>
> I think that it may be not an upfront formalization if we focus on a
> logical plan model. Since a logical plan model is generally equivalent to
> an algebra, we don't need to concern physical execution methods (i.e.,
> iteration models over nested data set).
>
> --
> Hyunsik Choi
>
> On Mon, Aug 27, 2012 at 7:07 PM, Camuel Gilyadov <camuel@gmail.com> wrote:
>
> > Of course optimizer must work with some intermediate form of query,
> which I
> > think could be an object graph expressed for now by using ordinary
> > programming language objects without much fuss (like Java or Python).
> > However, I think its upfront formalization is going to develop into
> > never-ending story and mainly because iteration over nested datasets is
> far
> > from being clear, different nested dataset languages are not only differs
> > in syntax but also in the iteration model itself. I think Rob
> > Grzywinski started this discussion in separate thread already...
> >
> > Also regarding optimizer and DAG, as there is not much index/joins action
> > going on what we have now is mostly a chain of transformations
> > which of-course is also formally a DAG :) but thinking about it as a
> chain
> > will simplify things at first, there is no much optimizations you can do
> > with it. If you go one level down and consider scalar operations then you
> > get more elaborate DAG of course.
> > -----------------
> > Let's separate issues:
> >
> > 1. Query Plan is distributed workload and that must be formalized and I
> > think no one suggests otherwise. Also no one suggest other model than DAG
> > except me. I suggest unrestricted graph just to keep backend useful for
> > other stuff, for the purposes of DrQL DAG is more than adequate in my
> > opinion. However, this is a DAG of identical nodes, and it follows
> physical
> > data partitioning. Let's label it physical DAG in order not to confuse
> with
> > logical query plan.
> >
> > 2. Open and somewhat confused issue is what actually runs a single node
> of
> > the above mentioned physical DAG? Is this a formalized query plan or just
> > arbitrary code?
> >
> > 3. Query plan formalization: main obstacle here is that the model of
> > iterating nested datasets are far from clear. Particularly nor Dremel
> paper
> > neither BigQuery reference describe well the behavior of querying nested
> > datasets with all different subcases. There many other languages to query
> > nested data but the iteration model varies significantly between them.
> For
> > formalization we miss one another academic paper which
> > would rigorously define canonical high-performance iteration model for
> > nested datasets.
> >
> > 4. Another complicating factor is columnar optimization. Drill is going
> to
> > be nested-columnar engine and as such part of query plan must be
> columnar.
> > So full set of column-oriented and record-oriented primitives are needed
> > record-construction primitives.
> >
> >
> > On Mon, Aug 27, 2012 at 9:29 AM, Hyunsik Choi <hyunsik.choi@gmail.com
> > >wrote:
> >
> > > Hi David,
> > >
> > > I agree with some of your claims. I also think that now DrQL may be
> > enough
> > > to Drill project.
> > >
> > > Even if we don't support various query languages, I think complex query
> > > languages (like SQL and DrQL) should have an logical form in order to
> > deal
> > > a given query without considering actual physical information. It
> > provides
> > > an easy way to modify the query to be more optimized one (e.g., pushing
> > > down projection, selection, and finding the best operator order) while
> > the
> > > optimized one is logically equivalent to the original query.
> > >
> > > Also, It would not hurt performance. For example, OLTP that processes a
> > > query within a few milliseconds already employs such a logical plan
> > > model. Although a logical plan is generic, it is not hugely different
> to
> > > existing logical plan models.
> > >
> > > --
> > > Hyunsik Choi
> > >
> > > On Mon, Aug 27, 2012 at 2:34 PM, David Gruzman <david@bigdatacraft.com
> > > >wrote:
> > >
> > > > Hi,
> > > > Dremel is high performance system. I think building something generic
> > > > "inter-languages" will hurt performance.
> > > > Having generic executor service we can add several different
> paradigms
> > of
> > > > the local computation (and even not local). But I think
> > > > SQL like query language should be done in most efficient way.
> > > > David
> > > >
> > > > On Mon, Aug 27, 2012 at 3:20 AM, Hyunsik Choi <hyunsik@apache.org>
> > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > How about having a generic logical plan described as a DAG, where
> > each
> > > > > vertex indicates a logical operator including various annotations
> and
> > > > each
> > > > > edge represents a data flow. A DAG has much expressive power. Many
> > > > > literatures have shown that most logical plans of various data
> > > > manipulation
> > > > > languages can be described as such a DAG.
> > > > >
> > > > > Additional languages have different ASTs, and they can be
> transformed
> > > > into
> > > > > the generic logical plan. In this case, we can reuse logical plan,
> > > > logical
> > > > > plan optimization, and physical execution plan. Besides, Drill may
> > > > consider
> > > > >  a global plan that represents the distributed execution plan.
> Since
> > > the
> > > > > global plan generally depends on the logical plan, we can also
> reuse
> > > all
> > > > > code related to the global plan.
> > > > >
> > > > > --
> > > > > Hyunsik Choi
> > > > >
> > > > >
> > > > > On Mon, Aug 27, 2012 at 6:22 AM, Ted Dunning <
> ted.dunning@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Camuel,
> > > > > >
> > > > > > Do you have a grammar test suite that demonstrates the range
of
> > > > > > expressions?
> > > > > >
> > > > > > Also, I believe that some have a goal to use additional languages
> > > > besides
> > > > > > SQL like languages.  A limited version of pig, for instance,
> would
> > be
> > > > > very
> > > > > > interesting.  To do this, it will be important to have a logical
> > plan
> > > > > > structure that is common for different syntaxes and is not
> limited
> > to
> > > > the
> > > > > > idiosyncracies of any particular syntax.
> > > > > >
> > > > > > How do you think that should be handled?  Do you have an idea
> for a
> > > > > logical
> > > > > > plan structure?
> > > > > >
> > > > > > On Sun, Aug 26, 2012 at 4:11 PM, Camuel Gilyadov <
> camuel@gmail.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > > I've written and attached ANTLR grammar for DrQL which
I assume
> > is
> > > > same
> > > > > > as
> > > > > > > BigQuery language described in Query Reference on BigQuery
> > website.
> > > > > This
> > > > > > > grammar includes AST production rules.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message