drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacques Nadeau <jacques.dr...@gmail.com>
Subject Re: Schemaless Schema Management: Pass per record, Per batch, or ?
Date Fri, 16 Nov 2012 00:47:00 GMT
Timothy,

>IDL
Great!

>Filtering
I've created a document--just copying and pasting my email thoughts.  You
and I have edit, everyone else has comment.  (If anyone else wants edit,
let us know.)  We can start to iterate there.

https://docs.google.com/document/d/1hQMjmqmjw7ptBx0TbBRlGtEeXWhE2a-CinZx5ffm9Bg/edit

Thanks,
Jacques


On Thu, Nov 15, 2012 at 12:02 PM, Timothy Chen <tnachen@gmail.com> wrote:

> Hi Jacques,
>
> Ill add a simple IDL that we can iterate on.
>
> About the filtering discussion, do you want to bring this discussion to a
> google doc?
>
> Tim
>
> Sent from my iPhone
>
> On Nov 14, 2012, at 9:42 PM, Jacques Nadeau <jacques.drill@gmail.com>
> wrote:
>
> > Hey Timothy,
> >
> > It's great that you started pulling something together.  Thanks for
> taking
> > the initiative!  Do you want to spend some time looking at trying to
> define
> > an IDL for MsgPack for schema information and add that to your work?
> >
> > We also need to come up with a standard selection/filter
> > vocabulary/approach.  It would preferably cover things like:
> >
> >   - Support simple field/tree inclusion lists and wildcards.
> >      - Classic relational like {column1, column2, column3}
> >      - Nested like {arrayColumn1.[*], mapColumn.foo}
> >   - Support some kind of filters such that could prune record, leaves, or
> >   branches
> >      - only include the first three sub elements
> >      - only include map keys that start with "user%"
> >      - only include this record where at least one
> >      arrayColumn.phone-number starts with "415%"
> >
> > One idea might be to conceive of a fourth concept on top of the classic
> > (table|scalar|aggregate) functions called tree functions and generate a
> set
> > of primitives for that.  Then allow scalar functions inside tree function
> > evaluation.  (I haven't thought a great deal about what this means.)
> > I've also thought that xpath might be a good place to look for conceptual
> > inspiration.  (But I don't think we have any interest to go to that
> > level...)
> >
> > Does any of this sound interesting?   (That also goes for anyone out
> there
> > who is lurking...)
> >
> > Thanks again,
> > Jacques
> >
> >
> > On Wed, Nov 14, 2012 at 5:45 PM, Timothy Chen <tnachen@gmail.com> wrote:
> >
> >> I don't have much to add to the options you've suggested, I do agree
> >> storing the schema and sending the diffs will be the most ideal way to
> go.
> >>
> >> And since we already need to look at every row, we can build the schema
> >> diffs pretty easily.
> >>
> >> I currently have a simple JSON -> MsgPack impl using Yajl here:
> >>
> https://github.com/tnachen/incubator-drill/tree/executor/sandbox/executor
> >>
> >> Depending on the parser we use, most already have basic types detection
> and
> >> we can extend more data types discovery later on as extensions.
> >>
> >> Tim
> >>
> >>
> >>
> >> On Wed, Nov 14, 2012 at 3:17 PM, Jacques Nadeau <
> jacques.drill@gmail.com
> >>> wrote:
> >>
> >>> One of the goals we've talked about for Drill is the ability to consume
> >>> "schemaless" data.  What this really means to me is data such as JSON
> >> where
> >>> the schema of data could change from record to record (and isn't known
> >>> until query execution).  I'd suggest that in most cases, the schema
> >> within
> >>> a JSON 'source' (collection of similar files) is mostly stable.  The
> >>> default JSON format passes this schema data with each record.  This
> would
> >>> be the simplest way to manage this data.  However, if Drill operated in
> >>> this manner we'd likely have to manage fairly different code paths for
> >> data
> >>> with schema versus those without.  There also seems like there would
> be a
> >>> substantial processing and message size overhead interacting with all
> the
> >>> schema information for each record.  Couple of notes:
> >>>
> >>>   - By schema here I more mean the structure of the key names and
> nested
> >>>   structure of the data as opposed to value data types...
> >>>   - A simple example: we have a user table and one of the query
> >>>   expressions is user.phone-numbers.  If we query that without schema,
> >> we
> >>>   don't know if that is a scalar, a map or an array.  Thus... we can't
> >>> figure
> >>>   out the number of "fields" in the output stream.
> >>>
> >>>
> >>> Separately, we've also talked before about having all the main
> >> executional
> >>> components operating on a batches of records as a single work unit
> >>> (probably in MsgPack streaming format or similar).
> >>>
> >>> One way to manage schemaless data within these parameters is to
> generate
> >> a
> >>> concrete schema of data as we're reading it and sending it with each
> >> batch
> >>> of records.  To start, we could resend it with every batch.  Later, we
> >>> could add an optimization that the schema is only sent when it changes.
> >> A
> >>> nice additional option would be to store this schema stream as we're
> >>> running the first queries so we can treat this data as fully schemaed
> on
> >>> later queries.  (And also provide that schema back to whatever query
> >> parser
> >>> is being used.)
> >>>
> >>> Thoughts?  What about thoughts on data types discovery in schemaless
> data
> >>> such as JSON, CSV, etc?
> >>>
> >>> Jacques
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message