drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Parth Chandra <par...@apache.org>
Subject Re: Thinking about Drill 2.0
Date Thu, 08 Jun 2017 21:55:38 GMT
Some good work has been done in Arrow to get to a more formalized
representation of complex types (lists and maps) particularly trying to
address the nullability issues. My recommendation would be to get to a
reasonable level of integration with Arrow and then start submitting
changes/patches to Arrow as we need them in Drill. Arrow is moving faster
being at an earlier stage, so this approach is unlikely to hold us up.
It is also critical that we establish performance baselines before
switching to Arrow. We're hoping for improvement but must guard against
possible regressions.



On Wed, Jun 7, 2017 at 1:53 PM, Julien Le Dem <julien@ledem.net> wrote:

> Hi Paul,
> My 2ct regarding Arrow:
> The goal of Arrow is to be a standard representation that does not break
> compatibility in the future.
> If moving to Arrow is a breaking change, It don’t think it makes sense to
> abstract it out to present a row oriented representation to the client. I
> defeats the purpose.
> You can still use Arrow as your standard representation to the client and
> allow for custom vectors on the server side that get converted before
> sending. This sounds like it could be part of the smaller API you are
> taking about.
> As for backward compatibility with the Drill ValueVectors it is possible
> to make a compatibility layer that patches the few differences (byte
> instead of bits for nullability, some type width difference) with little
> code.
> For changing the offset vectors it would be great to have this discussion
> on the Arrow mailing list so that we don’t diverge. (one simple workaround
> seems to use 64K-1 batches?)
> Some work has been done on the arrow side regarding json support (for
> example maps are now nullable: ARROW-274)
> Cheers
> Julien
>
> > On Jun 5, 2017, at 11:59 AM, Paul Rogers <progers@mapr.com> wrote:
> >
> > Hi All,
> >
> > A while back there was a discussion about the scope of Drill 2.0. Got me
> thinking about possible topics. My two cents:
> >
> > Drill 2.0 should focus on making Drill’s external APIs production ready.
> This means five things:
> >
> > * Clearly identify and define each API.
> > * (Re)design each API to ensure it fully isolates the client from Drill
> internals.
> > * Ensure the API allows full version compatibility: Allow mixing of
> old/new clients and servers with some limits.
> > * Fully test each API.
> > * Fully document each API.
> >
> > Once client code is isolated from Drill internals, we are free to evolve
> the internals in either Drill 2.0 or a later release.
> >
> > In my mind, the top APIs to revisit are:
> >
> > * The drill client API.
> > * The storage plugin API.
> >
> > (Explanation below.)
> >
> > What other APIs should we consider? Here are some examples, please
> suggest items you know about:
> >
> > * Command line scripts and arguments
> > * REST API
> > * Names and contents of system tables
> > * Structure of the storage plugin configuration JSON
> > * Structure of the query profile
> > * Structure of the EXPLAIN PLAN output.
> > * Semantics of Drill functions, such as the date functions recently
> partially fixed by adding “ANSI” alternatives.
> > * Naming of config and system/session options.
> > * (Your suggestions here…)
> >
> > I’ve taken the liberty of moving some API-breaking tickets in the Apache
> Drill JIRA to 2.0. Perhaps we can add others so that we have a good
> inventory of 2.0 candidates.
> >
> > Here are the reasons for my two suggestions.
> >
> > Today, we expose Drill value vectors to the client. This means if we
> want to enhance anything about Drill’s internal memory format (i.e. value
> vectors, such as a possible move to Arrow), we break compatibility with old
> clients. Using value vectors also means we need a very large percentage of
> Drill’s internal code on the client in Java or C++. We are learning that
> doing so is a challenge.
> >
> > A new client API should follow established SQL database tradition: a
> synchronous, row-based API designed for versioning, for forward and
> backward compatibility, and to support ODBC and JDBC users.
> >
> > We can certainly maintain the existing full, async, heavy-weight client
> for our tests and for applications that would benefit from it.
> >
> > Once we define a new API, we are free to alter Drill’s value vectors to,
> say, add the needed null states to fully support JSON, to change offset
> vectors to not need n+1 values (which doubles vector size in 64K batches),
> and so on. Since vectors become private to Drill (or Arrow) after the new
> client API, we are free to innovate to improve them.
> >
> > Similarly, the storage plugin API exposes details of Calcite (which
> seems to evolve with each new version), exposes value vector
> implementations, and so on. A cleaner, simpler, more isolated API will
> allow storage plugins to be built faster, but will also isolate them from
> Drill internals changes. Without isolation, each change to Drill internals
> would require plugin authors to update their plugin before Drill can be
> released.
> >
> > Thoughts? Suggestions?
> >
> > Thanks,
> >
> > - Paul
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message