drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Camuel Gilyadov <cam...@gmail.com>
Subject schema language for Drill
Date Sun, 14 Oct 2012 19:00:01 GMT
We need to settle down for schema language for the project. I know there is
a lot of good will to support many schema languages and data formats and
also schemaless usecase. No problem with that. But let's go modular by
creating many compact single-purpose components which then could be
connected in various combinations to produce many different useful systems
each optimized for different scenario. And not all components must be used
in all combinations of course. However, I suggest, first producing a
complete path optimized for one scenario and only then extending it by
adding new components.

So in this context let's settle down for one schema language and one data
format. I strongly oppose the case of starting from "schemaless" usecase,
like drilling a loose set of json documents. And the reason is that
"schemaless" datasets contain in fact way too much schema information
within the dataset proper itself, worse off, it is a partial schema with
the other parts of schema loosely scattered across parsers and often in
hard-coded ugly imperative form. This is just so against Dremel approach
that goes so far as to encode all data into columnar form and then compress
it in the way that predicate evaluation could be done before decompression
and so on. We can add later the much needed "drilling loose pile of JSON
documents" usecase but since it is not a typical usecase let's not start
the project from it. The irony here is that truly schemaless dataset is
those that has schema supplied separately and therefore could afford to
contain zero schema information within the dataset proper.

So back to the schema language. I suggest sticking to .proto files and
having protobuf as initial "standard" data format. The Drill would become
initially a query system for protobuf data. The schema will be expressed in
.proto formats, DrQL queries would be validated against it and for each
result .proto schema would be supplied. This is also would the case of
internal data interchange. .proto schema will have initially two encodings:
the usual binary hierarchical and the "dremel" binary columnar. But with
any encoding it is exactly the same schema and data could be converted
between encodings without loss.

If not protobuf then we have several other formats that support the concept
of separate schema - like AVRO, THRIFT or oldies like XML and ASN1. I more
familiar with protobuf and avro and among these two I
strongly favor protobuf (OpenDremel uses AVRO and while it worked great,
schema language IS WAY TOO CRYPTIC and this is the only reason
I disfavor AVRO). I don't know Thrift and XML and ASN1 is so uncool now :)
that no one would bother. So from what I know I strongly suggest protobuf.

As I said it is not a life or death question, just a question from which
format we start coding... and therefore team experience does count.

What you think?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message