drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jacques Nadeau (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-19) Build a JSON scanner that does schema discovery
Date Tue, 15 Jan 2013 18:58:17 GMT

    [ https://issues.apache.org/jira/browse/DRILL-19?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13554154#comment-13554154

Jacques Nadeau commented on DRILL-19:

Interesting.  Here are some thoughts:

- I'm not seeing the entry point into the protoschema generation.  How does one try this out?
- It seems less important to actually generate the proto text.  The goal is a canonical form
of schema that we use for all our data sources.  I'd really like to leverage an existing format.
 I'm inclined towards a protobuf derivative since that is what HDFS and HBase went with.
- Possible approaches are to utilize Google's DescriptorProtos/Descriptors as the target output.
 Another option is to leverage what Protostuff already put together.  What would be nice about
leveraging their stuff is it means we get proto definition conversion to our schema data format
(object graph) for free using their built in compilers/transformers.    
- I'm going to start working shortly on the conversion of read objects into compact in-memory
format based on protobuf.  The goal is to also front this in-memory format with RecordPointer
interface that doesn't necessarily realize the data until necessary.  Once I build this, I'll
work with you on getting your scanner to output this format (as opposed to a pojo'y one).
- We should morph your iterator interface into one that looks like the one I put together
in RecordIterator in the ref interpreter.  I need to add a getSchema() option to mine.  The
key thing to note is how getRecordPointer is only called once at setup and we don't do any
> Build a JSON scanner that does schema discovery
> -----------------------------------------------
>                 Key: DRILL-19
>                 URL: https://issues.apache.org/jira/browse/DRILL-19
>             Project: Apache Drill
>          Issue Type: New Feature
>            Reporter: Jacques Nadeau
>            Assignee: Timothy Chen
> Build a JSON scanner that reads a file and converts it into two parts: a stream of records
and a schema which reflects the schema of the records.  

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message