drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Frantz <jfra...@maprtech.com>
Subject Re: schema language for Drill
Date Mon, 15 Oct 2012 14:37:21 GMT
I agree with Camuel that, compared to querying JSON, querying a columnar
Dremel-like format will be significantly faster. Also, a lot of
"schemaless" data has an implicit schema, so supplying the schema out of
band can reduce the processing overhead (this is what it looks like
BigQuery recently did to handle JSON).

That said, I think there are two big benefits of starting out by tackling
JSON. First off, JSON is the easiest to integrate with existing data sets
and there's no storage format to convert to. Secondly, I think JSON
exercises a wider set of issues than data with a well-defined schema such
that it will be much harder to adapt a protobuf-based system to handle
JSON. For example, using something like LLVM to compile JSON processing
code is a fairly poor fit since processing every value needs a large switch
to handle all the potential types.

If other people would prefer to go down the schema route first, I agree
with Camuel about starting with protobuf and adopting two formats: a binary
row-based format and a binary Dremel-like columnar format.


On Sun, Oct 14, 2012 at 12:00 PM, Camuel Gilyadov <camuel@gmail.com> wrote:

> We need to settle down for schema language for the project. I know there is
> a lot of good will to support many schema languages and data formats and
> also schemaless usecase. No problem with that. But let's go modular by
> creating many compact single-purpose components which then could be
> connected in various combinations to produce many different useful systems
> each optimized for different scenario. And not all components must be used
> in all combinations of course. However, I suggest, first producing a
> complete path optimized for one scenario and only then extending it by
> adding new components.
> So in this context let's settle down for one schema language and one data
> format. I strongly oppose the case of starting from "schemaless" usecase,
> like drilling a loose set of json documents. And the reason is that
> "schemaless" datasets contain in fact way too much schema information
> within the dataset proper itself, worse off, it is a partial schema with
> the other parts of schema loosely scattered across parsers and often in
> hard-coded ugly imperative form. This is just so against Dremel approach
> that goes so far as to encode all data into columnar form and then compress
> it in the way that predicate evaluation could be done before decompression
> and so on. We can add later the much needed "drilling loose pile of JSON
> documents" usecase but since it is not a typical usecase let's not start
> the project from it. The irony here is that truly schemaless dataset is
> those that has schema supplied separately and therefore could afford to
> contain zero schema information within the dataset proper.
> So back to the schema language. I suggest sticking to .proto files and
> having protobuf as initial "standard" data format. The Drill would become
> initially a query system for protobuf data. The schema will be expressed in
> .proto formats, DrQL queries would be validated against it and for each
> result .proto schema would be supplied. This is also would the case of
> internal data interchange. .proto schema will have initially two encodings:
> the usual binary hierarchical and the "dremel" binary columnar. But with
> any encoding it is exactly the same schema and data could be converted
> between encodings without loss.
> If not protobuf then we have several other formats that support the concept
> of separate schema - like AVRO, THRIFT or oldies like XML and ASN1. I more
> familiar with protobuf and avro and among these two I
> strongly favor protobuf (OpenDremel uses AVRO and while it worked great,
> schema language IS WAY TOO CRYPTIC and this is the only reason
> I disfavor AVRO). I don't know Thrift and XML and ASN1 is so uncool now :)
> that no one would bother. So from what I know I strongly suggest protobuf.
> As I said it is not a life or death question, just a question from which
> format we start coding... and therefore team experience does count.
> What you think?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message