drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Timothy Chen <tnac...@gmail.com>
Subject Re: Schemaless Schema Management: Pass per record, Per batch, or ?
Date Thu, 15 Nov 2012 20:02:41 GMT
Hi Jacques,

Ill add a simple IDL that we can iterate on.

About the filtering discussion, do you want to bring this discussion to a google doc?

Tim

Sent from my iPhone

On Nov 14, 2012, at 9:42 PM, Jacques Nadeau <jacques.drill@gmail.com> wrote:

> Hey Timothy,
> 
> It's great that you started pulling something together.  Thanks for taking
> the initiative!  Do you want to spend some time looking at trying to define
> an IDL for MsgPack for schema information and add that to your work?
> 
> We also need to come up with a standard selection/filter
> vocabulary/approach.  It would preferably cover things like:
> 
>   - Support simple field/tree inclusion lists and wildcards.
>      - Classic relational like {column1, column2, column3}
>      - Nested like {arrayColumn1.[*], mapColumn.foo}
>   - Support some kind of filters such that could prune record, leaves, or
>   branches
>      - only include the first three sub elements
>      - only include map keys that start with "user%"
>      - only include this record where at least one
>      arrayColumn.phone-number starts with "415%"
> 
> One idea might be to conceive of a fourth concept on top of the classic
> (table|scalar|aggregate) functions called tree functions and generate a set
> of primitives for that.  Then allow scalar functions inside tree function
> evaluation.  (I haven't thought a great deal about what this means.)
> I've also thought that xpath might be a good place to look for conceptual
> inspiration.  (But I don't think we have any interest to go to that
> level...)
> 
> Does any of this sound interesting?   (That also goes for anyone out there
> who is lurking...)
> 
> Thanks again,
> Jacques
> 
> 
> On Wed, Nov 14, 2012 at 5:45 PM, Timothy Chen <tnachen@gmail.com> wrote:
> 
>> I don't have much to add to the options you've suggested, I do agree
>> storing the schema and sending the diffs will be the most ideal way to go.
>> 
>> And since we already need to look at every row, we can build the schema
>> diffs pretty easily.
>> 
>> I currently have a simple JSON -> MsgPack impl using Yajl here:
>> https://github.com/tnachen/incubator-drill/tree/executor/sandbox/executor
>> 
>> Depending on the parser we use, most already have basic types detection and
>> we can extend more data types discovery later on as extensions.
>> 
>> Tim
>> 
>> 
>> 
>> On Wed, Nov 14, 2012 at 3:17 PM, Jacques Nadeau <jacques.drill@gmail.com
>>> wrote:
>> 
>>> One of the goals we've talked about for Drill is the ability to consume
>>> "schemaless" data.  What this really means to me is data such as JSON
>> where
>>> the schema of data could change from record to record (and isn't known
>>> until query execution).  I'd suggest that in most cases, the schema
>> within
>>> a JSON 'source' (collection of similar files) is mostly stable.  The
>>> default JSON format passes this schema data with each record.  This would
>>> be the simplest way to manage this data.  However, if Drill operated in
>>> this manner we'd likely have to manage fairly different code paths for
>> data
>>> with schema versus those without.  There also seems like there would be a
>>> substantial processing and message size overhead interacting with all the
>>> schema information for each record.  Couple of notes:
>>> 
>>>   - By schema here I more mean the structure of the key names and nested
>>>   structure of the data as opposed to value data types...
>>>   - A simple example: we have a user table and one of the query
>>>   expressions is user.phone-numbers.  If we query that without schema,
>> we
>>>   don't know if that is a scalar, a map or an array.  Thus... we can't
>>> figure
>>>   out the number of "fields" in the output stream.
>>> 
>>> 
>>> Separately, we've also talked before about having all the main
>> executional
>>> components operating on a batches of records as a single work unit
>>> (probably in MsgPack streaming format or similar).
>>> 
>>> One way to manage schemaless data within these parameters is to generate
>> a
>>> concrete schema of data as we're reading it and sending it with each
>> batch
>>> of records.  To start, we could resend it with every batch.  Later, we
>>> could add an optimization that the schema is only sent when it changes.
>> A
>>> nice additional option would be to store this schema stream as we're
>>> running the first queries so we can treat this data as fully schemaed on
>>> later queries.  (And also provide that schema back to whatever query
>> parser
>>> is being used.)
>>> 
>>> Thoughts?  What about thoughts on data types discovery in schemaless data
>>> such as JSON, CSV, etc?
>>> 
>>> Jacques
>> 

Mime
View raw message