drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheng <googch...@gmail.com>
Subject Re: Schemaless Schema Management: Pass per record, Per batch, or ?
Date Thu, 15 Nov 2012 08:27:04 GMT
for schemaless data, i think the tree trunk exists and is known ,  so we can detect changes

Jacques Nadeau <jacques.drill@gmail.com>编写:

>One of the goals we've talked about for Drill is the ability to consume
>"schemaless" data.  What this really means to me is data such as JSON where
>the schema of data could change from record to record (and isn't known
>until query execution).  I'd suggest that in most cases, the schema within
>a JSON 'source' (collection of similar files) is mostly stable.  The
>default JSON format passes this schema data with each record.  This would
>be the simplest way to manage this data.  However, if Drill operated in
>this manner we'd likely have to manage fairly different code paths for data
>with schema versus those without.  There also seems like there would be a
>substantial processing and message size overhead interacting with all the
>schema information for each record.  Couple of notes:
>   - By schema here I more mean the structure of the key names and nested
>   structure of the data as opposed to value data types...
>   - A simple example: we have a user table and one of the query
>   expressions is user.phone-numbers.  If we query that without schema, we
>   don't know if that is a scalar, a map or an array.  Thus... we can't figure
>   out the number of "fields" in the output stream.
>Separately, we've also talked before about having all the main executional
>components operating on a batches of records as a single work unit
>(probably in MsgPack streaming format or similar).
>One way to manage schemaless data within these parameters is to generate a
>concrete schema of data as we're reading it and sending it with each batch
>of records.  To start, we could resend it with every batch.  Later, we
>could add an optimization that the schema is only sent when it changes.  A
>nice additional option would be to store this schema stream as we're
>running the first queries so we can treat this data as fully schemaed on
>later queries.  (And also provide that schema back to whatever query parser
>is being used.)
>Thoughts?  What about thoughts on data types discovery in schemaless data
>such as JSON, CSV, etc?
View raw message