drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacques Nadeau <jacques.dr...@gmail.com>
Subject Re: Schemaless Schema Management: Pass per record, Per batch, or ?
Date Thu, 15 Nov 2012 15:47:24 GMT
Ted, I completely agree with regards to batch size versus schema size and
the potential nominal impact of diffs versus complete schemas.  My
intention was more about change detection for downstream efficiency than
wire size.  A particular downstream operator may need to reinitialize its
operation as the schema changes.  One way to handle this might be a simple
schema number that is incremented each time the schema changes on the
scanner side.  We can start by sending the schema with each batch.  If we
resend the same schema each time, at least we know it is the same schema so
we don't have to reinitialize the downstream operator on each batch.  I
suppose it isn't necessary but seems like it could be an easy win and
straight forward for the schema generator to execute...

J

On Thu, Nov 15, 2012 at 1:07 AM, Ted Dunning <ted.dunning@gmail.com> wrote:

> Detecting changes can definitely be done, but I question whether the
> savings will be significant if the batches of records are sufficiently
> large to dwarf the derived schema.  If the records are much larger than the
> schema, then recording deltas will pretty much not help.
>
> As such, I would think that staying simple first is a good thing.
>
> On Thu, Nov 15, 2012 at 12:27 AM, Cheng <googcheng@gmail.com> wrote:
>
> > for schemaless data, i think the tree trunk exists and is known ,  so we
> > can detect changes only.
> >
> > Jacques Nadeau <jacques.drill@gmail.com>编写:
> >
> > >One of the goals we've talked about for Drill is the ability to consume
> > >"schemaless" data.  What this really means to me is data such as JSON
> > where
> > >the schema of data could change from record to record (and isn't known
> > >until query execution).  I'd suggest that in most cases, the schema
> within
> > >a JSON 'source' (collection of similar files) is mostly stable.  The
> > >default JSON format passes this schema data with each record.  This
> would
> > >be the simplest way to manage this data.  However, if Drill operated in
> > >this manner we'd likely have to manage fairly different code paths for
> > data
> > >with schema versus those without.  There also seems like there would be
> a
> > >substantial processing and message size overhead interacting with all
> the
> > >schema information for each record.  Couple of notes:
> > >
> > >   - By schema here I more mean the structure of the key names and
> nested
> > >   structure of the data as opposed to value data types...
> > >   - A simple example: we have a user table and one of the query
> > >   expressions is user.phone-numbers.  If we query that without schema,
> we
> > >   don't know if that is a scalar, a map or an array.  Thus... we can't
> > figure
> > >   out the number of "fields" in the output stream.
> > >
> > >
> > >Separately, we've also talked before about having all the main
> executional
> > >components operating on a batches of records as a single work unit
> > >(probably in MsgPack streaming format or similar).
> > >
> > >One way to manage schemaless data within these parameters is to
> generate a
> > >concrete schema of data as we're reading it and sending it with each
> batch
> > >of records.  To start, we could resend it with every batch.  Later, we
> > >could add an optimization that the schema is only sent when it changes.
>  A
> > >nice additional option would be to store this schema stream as we're
> > >running the first queries so we can treat this data as fully schemaed on
> > >later queries.  (And also provide that schema back to whatever query
> > parser
> > >is being used.)
> > >
> > >Thoughts?  What about thoughts on data types discovery in schemaless
> data
> > >such as JSON, CSV, etc?
> > >
> > >Jacques
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message