drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Hausenblas <michael.hausenb...@gmail.com>
Subject Fwd: Apache Drill and Schema
Date Sat, 11 May 2013 18:43:21 GMT

After a recent Apache Drill talk in Berlin, one of the people in the audience, Gunnar (in
CC) raised some very interesting schema-related challenges. He was so kind to write up his
questions, see below. I hope we do, as a community, a good job addressing these.

Again, thanks a lot for your interest in Drill, Gunnar, and hope you'll find time to test
drive it once it's available in all its beauty! 


Michael Hausenblas
Ireland, Europe

Begin forwarded message:

> From: Gunnar Schröder <gunnar.schroeder@tu-dresden.de>
> Subject: Apache Drill and Schema
> Date: 9 May 2013 22:15:59 GMT+01:00
> Hi Michael,
> as I promised I e-mail you some questions regarding Apache Drill, that crossed my mind
when hearing your talk. Since you might want to forward this e-mail or parts of it to the
mailing list, I will write you in English. :-)
> The question that puzzled me most after hearing your talk was how Apache Drill handles
schema information about the data in any of its sources.
> RDBMSs for example have a very strict schema definition (Schema on write). Hive and Impala
use the metastore to define a very similar schema definition, that must be available before
querying data (Schema on Read). This schema is necessary for Hive to query HBase which has
a more flexible schema, e.g. dynamic columns and no explicit types for data.
> I have problems understanding, how you want to implement SQL as query language (or any
other high level query language) if Schema information is not or only partly available? If
you reuse the schema information of Hive in its metastore or relational database such as MySQL
you are fine. But how do you handle a raw CSV file or HBase without Hive schema information?
> Some simple examples:
> Select * from table order by col1;
> If the table is a CSV file or HBase table, how do you determine whether the ordering
is done using string sort order or numeric order?
> Select * from table;
> I'm no expert on this, but I assume that if you query a table using JDBC or ODBC you
need full schema information before you obtain the first result of the query.
> Select * from table1, table2 where table1.col1 = table2.col2;
> Let's assume the column to join on is a string. Some storage engines may make a distinction
between null and empty string. Some storage engines may be non first normal form and may have
multiple strings.
> These are just some examples, but I have doubts, whether it is possible to implement
SQL 2003 as query language without full schema information.
> I scrolled through the Apache Drill design document and it mentions a metadata repository
but the information regarding schema and typing is very sketchy.
> My recommendation would be:
> Start out by reusing the Hive metadata repository and implement Apache Drill on top of
this schema information. Then gradually replace or supplement this with your own (possibly
more dynamic and flexible) metadata repository.
> To be successful:
> - Keep it Apache Open Source Licence
> - Your major competitor is Impala, so think about in which respect you can be better
> - SQL and JDBC is very important to be successful (interface to DBs and reporting tools)
> - HBase would be great with a low latency SQL querying facility (different attempts are
> - Provide interfaces to plug in new storage or file format handlers (this is a strength
of Hive, but seems hard to do with Impala), e.g. to access files of your own particular format
within HDFS
> Maybe these random thought about Apache Drill are useful for you. I would love to see
a true open source alternative to Impala, which supplements Hive for real time queries, that
is implemented in Java and provides pluggable interfaces to extend it.
> Cheers,
> Gunnar Schröder

View raw message