drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacques Nadeau <jacques.dr...@gmail.com>
Subject Introduction, Plan Working Doc and data format
Date Thu, 08 Nov 2012 03:07:49 GMT
Hello fellow Drillbits,

I wanted to start by introducing myself.  My name is Jacques.  I recently
joined MapR to help develop Apache Drill.

I've been working on adding/updating the doc [1] around the logical and
physical plans.  I've spent some time trying to tease out some of the
hierarchical data concepts and would love any thoughts people have around
my approach to explode/implode operators, nesting and aggregates.  I'm
still working on adding some updated examples into the document.  I also
need to spend a bunch of time blowing out the physical part of the doc.

While I was working on examples it became clear to me that the formalized
plan syntax should be reasonably expressive and use an existing structured
format.  I also know that people generally want a text based format so that
someone can hand generate a plan.  As such, I propose that we move to a
working draft that is based on SSA concepts but utilizes JSON as it's
primary structure.  For example:

%1 := scan "text", "local:///logs/*.log", "gzip"

%2 := transform %1, regex(request.cookie, “persistent=([^;]*)”), “userId”

%3 := transform %2, regex(request.cookie , “session=([^;]*)”), “session”

might instead be turned into something like this:

physical-plan: [
  {
    scan: {
        type: "text",
        file: "local:///logs/*.log",
        compression: "gzip"
    }
  },
  {
    transform: {
        input: 0,
        transforms: [
            {expr: "regex(request.cookie, \“persistent=([^;]*)\”), name:
"userId"},
            {expr: "regex(request.cookie, \"session=([^;]*)"), name:
"session"}
        ]
    }
  }

]

You'll note that things like treating multiple transforms as one are much
more feasible.  (This becomes even more true for things like multiple join
conditions.)  Note here that I also removed aggregate and scalar function
expressions from the DAG (I need to work on detailing the syntax of these
in the doc).  For simplicity, I suggest the DAG focus solely on data stream
flow. (Whether we use the original SSA format, a JSON format, or something
else.  I considered XML as well as it gives DTDs which are more mature than
JSON-schema for validation but my experience is people find
XML intrinsically more challenging to consume.)

Lastly, I've reverted the language some around logical and physical plans.
 It seems like there was some confusion/disagreement about which words to
use.  Hopefully my current approach makes sense.  Please reference the
diagram in the doc to review the terminology.

I look forward to your thoughts as well as getting to know you all better.

Jacques

[1]
https://docs.google.com/document/d/1QTL8warUYS2KjldQrGUse7zp8eA72VKtLOHwfXy6c7I/edit

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message