drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacques Nadeau <jacques.dr...@gmail.com>
Subject Re: Updates to logical operators
Date Thu, 29 Nov 2012 19:10:13 GMT
A few quick additions for 1 and 2:

As Ted mentioned, I've doing work on a Github fork under jacques-n.  I'm
working on logical plan operator and expression parsing and some example
plans.  I've updated the docs with Ted's addition of the sequence syntactic
operator and the concept of references and have a working demo parser on my
GH fork.  I'm in the process of more closely mapping the operators across
the docs and the code and am also examining Optiq for methods
of cross-pollination and/or integration.


On Thu, Nov 29, 2012 at 8:48 AM, Ted Dunning <ted.dunning@gmail.com> wrote:

> I have been a bit flattened lately and thus am late on reporting some
> experiments I have done related to Drill.  Here are my conclusions.
> *1.*
> My first experiment was to compare a JSON formatted syntax for the logical
> plan similar to what Jacques is proposing.  I found that this was much more
> convenient for expressing plans, largely because of the ability to use
> syntactic nesting to indicate implicit data flow connections.  This avoids
> most of the need for explicit linking of the output of one operator to the
> input of the next and removes an enormous amount of clutter.
> There are still instances where explicit linking is required, but these
> actually are improved by JSON, if only because the explicit links are
> easier to follow when all of the trivial linkages are hidden.
> The current syntax proposal by Jacques does not provide for implicit
> linking, but I don't think that is a major issue since it is a fairly minor
> syntactic issue.
> *2.*
> *
> *
> My second experiment had to do with a comparison of a style of logical plan
> processing that I used years ago for building machine learning data-flows
> to the implementation style that Jacques has been using based on Jackson.
>  This was an interesting experience for me because I was pretty far behind
> on what Jackson can do.
> The older style was fun because it allowed pretty flexible interpretation
> of a flow and it decentralized the implementation of specific
> interpretations so that you could add new stuff easily.  It also was very
> concrete with little magic so that it could be adopted easily.  The result
> has been a viral transmission of this style through a number of financial
> modeling shops especially to do with fraud detection.  The crux of the
> method is that there is a one-to-one correspondence between operators and
> implementation classes and each of the implementation classes has a
> constructor that accepts the specification of the operator (as a JSON
> object for this case).  That constructor is responsible for dealing with
> the specification and can call a utility class to instantiate any operators
> in that specification.
> The alternative is to use Jackson to directly deserialize the JSON plan to
> objects.  Jackson provides type resolvers so that you can glue whatever
> type of object you might like onto the operators in the plan.
> My conclusion of this experiment is that in spite of the conceptual
> simplicity of the older approach, the Jackson approach really is better and
> simpler.  There will necessarily be some need for packaging that approach
> to make it as easy to explain as my older approach, but the consistency and
> flexibility really will be better that way.
> *3.*
> A third experiment had to do with generating code on the fly for evaluating
> expressions efficiently.  The problem is that we have expressions in terms
> of built-in functions and operations that reference data records.  These
> expressions need to be evaluated very efficiently in order to maintain the
> speed of the eventual execution engine.  Impala does this using LLVM to
> generate native code, for instance, and many modern databases do similar
> things.
> Ryan Rawson pointed me at Janino, saying that they have been using it very
> happily at Drawn to Scale.
> My experiments involved parsing expressions from their original format and
> then unparsing the expressions back into Java which I then compiled on the
> fly using Janino for evaluation.
> The results were very, very good.  The Janino API meant that the code
> required to compile an expression is tiny and to the point and evaluation
> of the expression is extremely easy.  The fact that the code goes directly
> to JVM byte codes means that all of the power of the JVM's JIT will be used
> to make the evaluations go fast.  The only slowdown in the code that I used
> was the construction of a map to inject values into the expression.  In a
> real execution engine, I expect that this map will have one value (the
> array holding the incoming record) and thus will be able to last for the
> life of the operator.  The expression will only contain references to this
> array so data injection will not require any extra object creation.
> On Wed, Nov 28, 2012 at 6:26 PM, Jacques Nadeau <jacques.drill@gmail.com
> >wrote:
> > I've updated the logical plan syntax and operator list.  I've updated for
> > JSON operator definitions.  I've also moved away from the abstract notion
> > of explosion/implosion and nested operators because of the created
> > interdependencies between operators.  Instead, the logical plan provides
> > more clearly defined operators.  Theoretically, this could suggest a
> > substantial data redundancy between individual operators.  That being
> said,
> > I think this isn't an issue because of two reasons: 1) there is nothing
> > stopping the creation of a physical operator and associated planner rule
> > that converts multiple logical operators into a single physical operator
> > that does explode...implode-like stuff and 2) "quote marks" encoding
> could
> > be a key principle to our wire encoding scheme that minimizes pushing
> > around flattened duplicate data.  I'd love feedback on the set of logical
> > operators, especially those focused on aggregate/window frame/cogroup.
>  You
> > can check out the doc on google:
> >
> >
> https://docs.google.com/document/d/1QTL8warUYS2KjldQrGUse7zp8eA72VKtLOHwfXy6c7I/edit
> >
> > Thanks,
> > Jacques
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message