drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Camuel Gilyadov <cam...@gmail.com>
Subject Re: DrQL grammar and parser
Date Mon, 27 Aug 2012 10:07:11 GMT
Of course optimizer must work with some intermediate form of query, which I
think could be an object graph expressed for now by using ordinary
programming language objects without much fuss (like Java or Python).
However, I think its upfront formalization is going to develop into
never-ending story and mainly because iteration over nested datasets is far
from being clear, different nested dataset languages are not only differs
in syntax but also in the iteration model itself. I think Rob
Grzywinski started this discussion in separate thread already...

Also regarding optimizer and DAG, as there is not much index/joins action
going on what we have now is mostly a chain of transformations
which of-course is also formally a DAG :) but thinking about it as a chain
will simplify things at first, there is no much optimizations you can do
with it. If you go one level down and consider scalar operations then you
get more elaborate DAG of course.
-----------------
Let's separate issues:

1. Query Plan is distributed workload and that must be formalized and I
think no one suggests otherwise. Also no one suggest other model than DAG
except me. I suggest unrestricted graph just to keep backend useful for
other stuff, for the purposes of DrQL DAG is more than adequate in my
opinion. However, this is a DAG of identical nodes, and it follows physical
data partitioning. Let's label it physical DAG in order not to confuse with
logical query plan.

2. Open and somewhat confused issue is what actually runs a single node of
the above mentioned physical DAG? Is this a formalized query plan or just
arbitrary code?

3. Query plan formalization: main obstacle here is that the model of
iterating nested datasets are far from clear. Particularly nor Dremel paper
neither BigQuery reference describe well the behavior of querying nested
datasets with all different subcases. There many other languages to query
nested data but the iteration model varies significantly between them. For
formalization we miss one another academic paper which
would rigorously define canonical high-performance iteration model for
nested datasets.

4. Another complicating factor is columnar optimization. Drill is going to
be nested-columnar engine and as such part of query plan must be columnar.
So full set of column-oriented and record-oriented primitives are needed
record-construction primitives.


On Mon, Aug 27, 2012 at 9:29 AM, Hyunsik Choi <hyunsik.choi@gmail.com>wrote:

> Hi David,
>
> I agree with some of your claims. I also think that now DrQL may be enough
> to Drill project.
>
> Even if we don't support various query languages, I think complex query
> languages (like SQL and DrQL) should have an logical form in order to deal
> a given query without considering actual physical information. It provides
> an easy way to modify the query to be more optimized one (e.g., pushing
> down projection, selection, and finding the best operator order) while the
> optimized one is logically equivalent to the original query.
>
> Also, It would not hurt performance. For example, OLTP that processes a
> query within a few milliseconds already employs such a logical plan
> model. Although a logical plan is generic, it is not hugely different to
> existing logical plan models.
>
> --
> Hyunsik Choi
>
> On Mon, Aug 27, 2012 at 2:34 PM, David Gruzman <david@bigdatacraft.com
> >wrote:
>
> > Hi,
> > Dremel is high performance system. I think building something generic
> > "inter-languages" will hurt performance.
> > Having generic executor service we can add several different paradigms of
> > the local computation (and even not local). But I think
> > SQL like query language should be done in most efficient way.
> > David
> >
> > On Mon, Aug 27, 2012 at 3:20 AM, Hyunsik Choi <hyunsik@apache.org>
> wrote:
> >
> > > Hi,
> > >
> > > How about having a generic logical plan described as a DAG, where each
> > > vertex indicates a logical operator including various annotations and
> > each
> > > edge represents a data flow. A DAG has much expressive power. Many
> > > literatures have shown that most logical plans of various data
> > manipulation
> > > languages can be described as such a DAG.
> > >
> > > Additional languages have different ASTs, and they can be transformed
> > into
> > > the generic logical plan. In this case, we can reuse logical plan,
> > logical
> > > plan optimization, and physical execution plan. Besides, Drill may
> > consider
> > >  a global plan that represents the distributed execution plan. Since
> the
> > > global plan generally depends on the logical plan, we can also reuse
> all
> > > code related to the global plan.
> > >
> > > --
> > > Hyunsik Choi
> > >
> > >
> > > On Mon, Aug 27, 2012 at 6:22 AM, Ted Dunning <ted.dunning@gmail.com>
> > > wrote:
> > >
> > > > Camuel,
> > > >
> > > > Do you have a grammar test suite that demonstrates the range of
> > > > expressions?
> > > >
> > > > Also, I believe that some have a goal to use additional languages
> > besides
> > > > SQL like languages.  A limited version of pig, for instance, would be
> > > very
> > > > interesting.  To do this, it will be important to have a logical plan
> > > > structure that is common for different syntaxes and is not limited to
> > the
> > > > idiosyncracies of any particular syntax.
> > > >
> > > > How do you think that should be handled?  Do you have an idea for a
> > > logical
> > > > plan structure?
> > > >
> > > > On Sun, Aug 26, 2012 at 4:11 PM, Camuel Gilyadov <camuel@gmail.com>
> > > wrote:
> > > >
> > > > > I've written and attached ANTLR grammar for DrQL which I assume is
> > same
> > > > as
> > > > > BigQuery language described in Query Reference on BigQuery website.
> > > This
> > > > > grammar includes AST production rules.
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message