drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Introduction, Plan Working Doc and data format
Date Thu, 08 Nov 2012 07:47:28 GMT
Welcome Jacques!

On Wed, Nov 7, 2012 at 7:07 PM, Jacques Nadeau <jacques.drill@gmail.com>wrote:

> Hello fellow Drillbits,
>
> I wanted to start by introducing myself.  My name is Jacques.  I recently
> joined MapR to help develop Apache Drill.
>
> I've been working on adding/updating the doc [1] around the logical and
> physical plans.  I've spent some time trying to tease out some of the
> hierarchical data concepts and would love any thoughts people have around
> my approach to explode/implode operators, nesting and aggregates.  I'm
> still working on adding some updated examples into the document.  I also
> need to spend a bunch of time blowing out the physical part of the doc.
>
> While I was working on examples it became clear to me that the formalized
> plan syntax should be reasonably expressive and use an existing structured
> format.  I also know that people generally want a text based format so that
> someone can hand generate a plan.  As such, I propose that we move to a
> working draft that is based on SSA concepts but utilizes JSON as it's
> primary structure.  For example:
>
> %1 := scan "text", "local:///logs/*.log", "gzip"
>
> %2 := transform %1, regex(request.cookie, “persistent=([^;]*)”), “userId”
>
> %3 := transform %2, regex(request.cookie , “session=([^;]*)”), “session”
>
> might instead be turned into something like this:
>
> physical-plan: [
>   {
>     scan: {
>         type: "text",
>         file: "local:///logs/*.log",
>         compression: "gzip"
>     }
>   },
>   {
>     transform: {
>         input: 0,
>         transforms: [
>             {expr: "regex(request.cookie, \“persistent=([^;]*)\”), name:
> "userId"},
>             {expr: "regex(request.cookie, \"session=([^;]*)"), name:
> "session"}
>         ]
>     }
>   }
>
> ]
>
> You'll note that things like treating multiple transforms as one are much
> more feasible.  (This becomes even more true for things like multiple join
> conditions.)  Note here that I also removed aggregate and scalar function
> expressions from the DAG (I need to work on detailing the syntax of these
> in the doc).  For simplicity, I suggest the DAG focus solely on data stream
> flow. (Whether we use the original SSA format, a JSON format, or something
> else.  I considered XML as well as it gives DTDs which are more mature than
> JSON-schema for validation but my experience is people find
> XML intrinsically more challenging to consume.)
>
> Lastly, I've reverted the language some around logical and physical plans.
>  It seems like there was some confusion/disagreement about which words to
> use.  Hopefully my current approach makes sense.  Please reference the
> diagram in the doc to review the terminology.
>
> I look forward to your thoughts as well as getting to know you all better.
>
> Jacques
>
> [1]
>
> https://docs.google.com/document/d/1QTL8warUYS2KjldQrGUse7zp8eA72VKtLOHwfXy6c7I/edit
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message