drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Updates to logical operators
Date Thu, 29 Nov 2012 16:48:51 GMT
I have been a bit flattened lately and thus am late on reporting some
experiments I have done related to Drill.  Here are my conclusions.

*1.*

My first experiment was to compare a JSON formatted syntax for the logical
plan similar to what Jacques is proposing.  I found that this was much more
convenient for expressing plans, largely because of the ability to use
syntactic nesting to indicate implicit data flow connections.  This avoids
most of the need for explicit linking of the output of one operator to the
input of the next and removes an enormous amount of clutter.

There are still instances where explicit linking is required, but these
actually are improved by JSON, if only because the explicit links are
easier to follow when all of the trivial linkages are hidden.

The current syntax proposal by Jacques does not provide for implicit
linking, but I don't think that is a major issue since it is a fairly minor
syntactic issue.

*2.*
*
*
My second experiment had to do with a comparison of a style of logical plan
processing that I used years ago for building machine learning data-flows
to the implementation style that Jacques has been using based on Jackson.
 This was an interesting experience for me because I was pretty far behind
on what Jackson can do.

The older style was fun because it allowed pretty flexible interpretation
of a flow and it decentralized the implementation of specific
interpretations so that you could add new stuff easily.  It also was very
concrete with little magic so that it could be adopted easily.  The result
has been a viral transmission of this style through a number of financial
modeling shops especially to do with fraud detection.  The crux of the
method is that there is a one-to-one correspondence between operators and
implementation classes and each of the implementation classes has a
constructor that accepts the specification of the operator (as a JSON
object for this case).  That constructor is responsible for dealing with
the specification and can call a utility class to instantiate any operators
in that specification.

The alternative is to use Jackson to directly deserialize the JSON plan to
objects.  Jackson provides type resolvers so that you can glue whatever
type of object you might like onto the operators in the plan.

My conclusion of this experiment is that in spite of the conceptual
simplicity of the older approach, the Jackson approach really is better and
simpler.  There will necessarily be some need for packaging that approach
to make it as easy to explain as my older approach, but the consistency and
flexibility really will be better that way.

*3.*

A third experiment had to do with generating code on the fly for evaluating
expressions efficiently.  The problem is that we have expressions in terms
of built-in functions and operations that reference data records.  These
expressions need to be evaluated very efficiently in order to maintain the
speed of the eventual execution engine.  Impala does this using LLVM to
generate native code, for instance, and many modern databases do similar
things.

Ryan Rawson pointed me at Janino, saying that they have been using it very
happily at Drawn to Scale.

My experiments involved parsing expressions from their original format and
then unparsing the expressions back into Java which I then compiled on the
fly using Janino for evaluation.

The results were very, very good.  The Janino API meant that the code
required to compile an expression is tiny and to the point and evaluation
of the expression is extremely easy.  The fact that the code goes directly
to JVM byte codes means that all of the power of the JVM's JIT will be used
to make the evaluations go fast.  The only slowdown in the code that I used
was the construction of a map to inject values into the expression.  In a
real execution engine, I expect that this map will have one value (the
array holding the incoming record) and thus will be able to last for the
life of the operator.  The expression will only contain references to this
array so data injection will not require any extra object creation.

On Wed, Nov 28, 2012 at 6:26 PM, Jacques Nadeau <jacques.drill@gmail.com>wrote:

> I've updated the logical plan syntax and operator list.  I've updated for
> JSON operator definitions.  I've also moved away from the abstract notion
> of explosion/implosion and nested operators because of the created
> interdependencies between operators.  Instead, the logical plan provides
> more clearly defined operators.  Theoretically, this could suggest a
> substantial data redundancy between individual operators.  That being said,
> I think this isn't an issue because of two reasons: 1) there is nothing
> stopping the creation of a physical operator and associated planner rule
> that converts multiple logical operators into a single physical operator
> that does explode...implode-like stuff and 2) "quote marks" encoding could
> be a key principle to our wire encoding scheme that minimizes pushing
> around flattened duplicate data.  I'd love feedback on the set of logical
> operators, especially those focused on aggregate/window frame/cogroup.  You
> can check out the doc on google:
>
> https://docs.google.com/document/d/1QTL8warUYS2KjldQrGUse7zp8eA72VKtLOHwfXy6c7I/edit
>
> Thanks,
> Jacques
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message