drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Schemaless Schema Management: Pass per record, Per batch, or ?
Date Thu, 15 Nov 2012 01:36:33 GMT
Incrementally generated an "as-observed" schema is an interesting idea.  It
should come nearly for free since the records need to be parsed in any case.

I would guess that this won't work so well with data that has highly
variable records structure, but it hardly seems likely to be any worse than
the truly schema-free design that we had been talking about to now.

On Wed, Nov 14, 2012 at 3:17 PM, Jacques Nadeau <jacques.drill@gmail.com>wrote:

> One of the goals we've talked about for Drill is the ability to consume
> "schemaless" data.  What this really means to me is data such as JSON where
> the schema of data could change from record to record (and isn't known
> until query execution).  I'd suggest that in most cases, the schema within
> a JSON 'source' (collection of similar files) is mostly stable.  The
> default JSON format passes this schema data with each record.  This would
> be the simplest way to manage this data.  However, if Drill operated in
> this manner we'd likely have to manage fairly different code paths for data
> with schema versus those without.  There also seems like there would be a
> substantial processing and message size overhead interacting with all the
> schema information for each record.  Couple of notes:
>    - By schema here I more mean the structure of the key names and nested
>    structure of the data as opposed to value data types...
>    - A simple example: we have a user table and one of the query
>    expressions is user.phone-numbers.  If we query that without schema, we
>    don't know if that is a scalar, a map or an array.  Thus... we can't
> figure
>    out the number of "fields" in the output stream.
> Separately, we've also talked before about having all the main executional
> components operating on a batches of records as a single work unit
> (probably in MsgPack streaming format or similar).
> One way to manage schemaless data within these parameters is to generate a
> concrete schema of data as we're reading it and sending it with each batch
> of records.  To start, we could resend it with every batch.  Later, we
> could add an optimization that the schema is only sent when it changes.  A
> nice additional option would be to store this schema stream as we're
> running the first queries so we can treat this data as fully schemaed on
> later queries.  (And also provide that schema back to whatever query parser
> is being used.)
> Thoughts?  What about thoughts on data types discovery in schemaless data
> such as JSON, CSV, etc?
> Jacques

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message