drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacques Nadeau <jacques.dr...@gmail.com>
Subject Schemaless Schema Management: Pass per record, Per batch, or ?
Date Wed, 14 Nov 2012 23:17:01 GMT
One of the goals we've talked about for Drill is the ability to consume
"schemaless" data.  What this really means to me is data such as JSON where
the schema of data could change from record to record (and isn't known
until query execution).  I'd suggest that in most cases, the schema within
a JSON 'source' (collection of similar files) is mostly stable.  The
default JSON format passes this schema data with each record.  This would
be the simplest way to manage this data.  However, if Drill operated in
this manner we'd likely have to manage fairly different code paths for data
with schema versus those without.  There also seems like there would be a
substantial processing and message size overhead interacting with all the
schema information for each record.  Couple of notes:

   - By schema here I more mean the structure of the key names and nested
   structure of the data as opposed to value data types...
   - A simple example: we have a user table and one of the query
   expressions is user.phone-numbers.  If we query that without schema, we
   don't know if that is a scalar, a map or an array.  Thus... we can't figure
   out the number of "fields" in the output stream.

Separately, we've also talked before about having all the main executional
components operating on a batches of records as a single work unit
(probably in MsgPack streaming format or similar).

One way to manage schemaless data within these parameters is to generate a
concrete schema of data as we're reading it and sending it with each batch
of records.  To start, we could resend it with every batch.  Later, we
could add an optimization that the schema is only sent when it changes.  A
nice additional option would be to store this schema stream as we're
running the first queries so we can treat this data as fully schemaed on
later queries.  (And also provide that schema back to whatever query parser
is being used.)

Thoughts?  What about thoughts on data types discovery in schemaless data
such as JSON, CSV, etc?


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message