drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Rogers <prog...@mapr.com>
Subject Thinking about Drill 2.0
Date Mon, 05 Jun 2017 18:59:24 GMT
Hi All,

A while back there was a discussion about the scope of Drill 2.0. Got me thinking about possible
topics. My two cents:

Drill 2.0 should focus on making Drill’s external APIs production ready. This means five
things:

* Clearly identify and define each API.
* (Re)design each API to ensure it fully isolates the client from Drill internals.
* Ensure the API allows full version compatibility: Allow mixing of old/new clients and servers
with some limits.
* Fully test each API.
* Fully document each API.

Once client code is isolated from Drill internals, we are free to evolve the internals in
either Drill 2.0 or a later release.

In my mind, the top APIs to revisit are:

* The drill client API.
* The storage plugin API.

(Explanation below.)

What other APIs should we consider? Here are some examples, please suggest items you know
about:

* Command line scripts and arguments
* REST API
* Names and contents of system tables
* Structure of the storage plugin configuration JSON
* Structure of the query profile
* Structure of the EXPLAIN PLAN output.
* Semantics of Drill functions, such as the date functions recently partially fixed by adding
“ANSI” alternatives.
* Naming of config and system/session options.
* (Your suggestions here…)

I’ve taken the liberty of moving some API-breaking tickets in the Apache Drill JIRA to 2.0.
Perhaps we can add others so that we have a good inventory of 2.0 candidates.

Here are the reasons for my two suggestions.

Today, we expose Drill value vectors to the client. This means if we want to enhance anything
about Drill’s internal memory format (i.e. value vectors, such as a possible move to Arrow),
we break compatibility with old clients. Using value vectors also means we need a very large
percentage of Drill’s internal code on the client in Java or C++. We are learning that doing
so is a challenge.

A new client API should follow established SQL database tradition: a synchronous, row-based
API designed for versioning, for forward and backward compatibility, and to support ODBC and
JDBC users.

We can certainly maintain the existing full, async, heavy-weight client for our tests and
for applications that would benefit from it.

Once we define a new API, we are free to alter Drill’s value vectors to, say, add the needed
null states to fully support JSON, to change offset vectors to not need n+1 values (which
doubles vector size in 64K batches), and so on. Since vectors become private to Drill (or
Arrow) after the new client API, we are free to innovate to improve them.

Similarly, the storage plugin API exposes details of Calcite (which seems to evolve with each
new version), exposes value vector implementations, and so on. A cleaner, simpler, more isolated
API will allow storage plugins to be built faster, but will also isolate them from Drill internals
changes. Without isolation, each change to Drill internals would require plugin authors to
update their plugin before Drill can be released.

Thoughts? Suggestions?

Thanks,

- Paul
Mime
View raw message