I've been pulling together a reference logical plan interpreter. I'm
working with Ted to get it inside the Drill sandbox. For now, you can find
it on my repo at https://github.com/jacques-n/incubator-drill (prototype
branch)
The goals of the reference interpreter are:
- To provide a simple way to run a Logical Plan against some sample data
and get back the expected result
- Allow work to start on the parsers while we scale up the performance
and capabilities of the execution engine and optimizer.
- Allow evaluation work on particular technical approaches such as
exploring the impact of hierarchical and schema less data on query
evaluation.
These goals do not include performance, memory handling, or
efficiency. Currently,
the interpreter is a single node/thread process. This will change shortly
so that it also run as a clustered process.
The entry point is inside the /sandbox/prototype/exec/ref module:
org.apache.drill.exec.ref.ReferenceInterpreter.main(); The example program
utilizes two resources: simple-plan.json and donuts.json and outputs data
to /opt/data/out.json.
Some of things that 'work'.
- Read/write basic json.
- ROPs (reference operators): Filter, Transform, Group, Aggregate
(simple), Order, Union.
- Example aggregate and basic functions including sum, count, multiply,
add, compare, equals.
Basic glossary/concepts (we'll get this on the wiki/javadocs):
- LOP: Logical Operator. An implementation agnostic data flow operator
utilized by the Logical Plan.
- ROP: Reference Operator: A reference operator implementation that
pairs with a LOP.
- FunctionDefinition: A definition of a particular function. Describes
a set of aliases, an allowable set of input arguments and an interface that
will attempt to determine output type.
- BasicEvaluator: An implementation of a particular non-aggregate
expression. Receives a record pointer at creation time. Returns a
DataValue.
- AggregateEvaluator: An implementation of a particular aggregating
function. Is provided a record pointer at creation time. Expects regular
calls to addRecord() followed by a call to eval() which provides the
aggregate value.
- DataValue: A pointer to a particular data value. Implementation
classes includes things like ScalarLong, ScalarBytes, SimpleMapValue and
SimpleArrayValue.
The standard record iterator utilized between each ROP utilizes the
org.apache.drill.exec.ref.RecordIterator interface. This is somewhat
inspired by the AttributeSource concepts from within the Lucene project.
(I'm planning to extend these concepts all the way to the individual
DataValues.)
My next goals are to add tests, finish adding ROPs, add local and remote
exchange nodes (parallelization), add a bunch of documentation and extract
out the Execution plan as a separate intermediate representation.
It needs a lot more evaluators to be a true reference interpreter (as well
as the rest of the ROPs). The existing ones can be utilized as prototypes.
Anyone interested in ripping through a bunch of additional evaluators and
associated FunctionDefinitions?
|