calcite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Hyde <jh...@apache.org>
Subject Re: Embed druid-sql inside Calcite?
Date Wed, 07 Feb 2018 23:46:07 GMT
Long term there doesn’t seem to be any point keeping Calcite’s druid adapter around. The
code would be an inferior duplicate of druid-sql, so we would want to 

But shorter term there will be quite a few things that Hive needs that will only exist in
Calcite’s druid adapter. The challenge will be the transition. You will need to convince
the Hive developers that the move is worthwhile. (It will help if you can point to some quick
benefits to making the transition.)

Julian


> On Feb 7, 2018, at 2:59 PM, Gian Merlino <gian@imply.io> wrote:
> 
> In the world where druid-sql is where Druid's Calcite API lives, what do
> you think would make the most sense for the current calcite-druid module?
> Would it make sense to remove it (and merge anything it does, that
> druid-sql doesn't already do, into druid-sql) or to keep it in the Calcite
> project but have it be a thin wrapper over druid-sql?
> 
> I guess this should be informed by who the users of calcite-druid are. At
> this point, I don't know much beyond the fact that Hive uses it.
> 
> Gian
> 
> On Wed, Feb 7, 2018 at 10:29 AM, Julian Hyde <jhyde@apache.org> wrote:
> 
>> I agree with you both.
>> 
>> For a particular engine, such as Druid, there are often 3 options:
>> 
>> 1. build a Calcite adapter to the engine's native query language;
>> 
>> 2. if the engine supports SQL, connect to the engine via Calcite's JDBC
>> adapter;
>> 
>> 3. if the engine exposes an API based on Calcite algebra, connect to that
>> API.
>> 
>> All of those options are valid for Druid right now, and 3 (Gian's
>> proposal) is likely to yield the best plans. As Gian correctly notes,
>> that is likely to increase the coupling, but we can live with that.
>> (If people want loose coupling they can talk to Druid via the JDBC
>> adapter, and we just need to make sure that the Druid JDBC dialect
>> knows that Druid cannot do joins.)
>> 
>> Nishant's core point seems to be that we need some kind of bulk
>> API/protocol to talk to Druid, to consume partial query results in
>> parallel. This is desirable because Hive is  -- how to put it
>> politely?! -- a "bigger" query engine. I'm sure that Spark, Presto and
>> Drill would want a similar API/protocol. When it exists, we can
>> generate a hybrid plan: Druid physical algebra that generates partial
>> results in parallel underneath Hive physical algebra that consumes
>> those results in parallel.
>> 
>> The same pattern occurred in Phoenix. Phoenix does not have
>> shuffle/exchange capabilities, so for big analytic queries we would
>> want to couple it with Hive/Spark/Presto/Drill. We talked about
>> Drillix (Drill + Phoenix) for a while but never completed it.
>> 
>> Julian
>> 
>> 
>> On Wed, Feb 7, 2018 at 9:07 AM, Nishant Bangarwa
>> <nishant.monu51@gmail.com> wrote:
>>> Having a focused effort into a single project would be great and would
>>> definitely help us in evolving druid sql capabilities faster.
>>> 
>>> 1) One more thing that we need to consider here is that calcite
>>> druid-adapter is also used in Apache Hive where we use the druid rules to
>>> generate an optimized plan and then the druid query is executed from
>> druid
>>> containers. In druid-sql I believe the query execution logic is tied to
>> the
>>> fact that execution node is a druid-broker where native queries can be
>> run
>>> to generate a Sequence of results. We might need some rework there to
>>> ensure that things work fine with hive too after proposed changes.
>>> 
>>> 2) druid-sql dependencies can probably be reduced by separating the
>>> planning and execution logic in druid-sql, the planning logic need not
>>> depend on lots of druid code and can have light-weight dependencies while
>>> the execution part and result serde which pulls in lots of druid
>>> dependencies can reside in separate module and calcite druid-adapter need
>>> not depend on that module.
>>> 
>>> I think, the hypothetical case you mentioned is also worth considering,
>> to
>>> ease up the development process, we can consider moving calcite-druid as
>> a
>>> module in druid, so that we make release of both druid-sql and
>>> calcite-adapter together.
>>> 
>>> On Wed, 7 Feb 2018 at 09:02 Gian Merlino <gian@imply.io> wrote:
>>> 
>>>> Hi Calcites,
>>>> 
>>>> I would like to raise the idea of adding druid-sql (
>>>> 
>>>> http://search.maven.org/#artifactdetails%7Cio.druid%
>> 7Cdruid-sql%7C0.11.0%7Cjar
>>>> )
>>>> as a dependency in Calcite's Druid adapter. It should reduce the size of
>>>> calcite-druid substantially, since it would mostly just be calling into
>>>> druid-sql.
>>>> 
>>>> This has some advantages for both projects.
>>>> 
>>>> 1) Support for new Druid features often appears in Druid SQL first. By
>>>> embedding druid-sql, Calcite gets these new features too, without extra
>>>> work. For example https://issues.apache.org/jira/browse/CALCITE-2170
>> is an
>>>> outstanding jira to add support for Druid expressions to Calcite, but
>>>> druid-sql already supports these. In fact it looks like some of the
>> code in
>>>> the proposed patch is copied from druid-sql. As another example,
>>>> https://issues.apache.org/jira/browse/CALCITE-2077 switched table scans
>>>> from "select" to "scan", which had been previously done in Druid SQL in
>>>> https://github.com/druid-io/druid/pull/4751.
>>>> 
>>>> 2) Depending on druid-sql means Calcite doesn't need to implement its
>> own
>>>> Druid query and result serde code. Druid already has it.
>>>> 
>>>> 3) Focused effort on a single module rather than the split effort that
>> we
>>>> have today, where some developers are contributing to druid-sql and some
>>>> are contributing to calcite-druid.
>>>> 
>>>> 4) More test coverage for both projects, presumably.
>>>> 
>>>> I think (3) and (4) especially would give us the opportunity to improve
>>>> both projects much more rapidly.
>>>> 
>>>> However, there are also some possible disadvantages.
>>>> 
>>>> 1) druid-sql is a somewhat heavyweight module. It pulls in a lot of
>> other
>>>> Druid code. Calcite users may prefer a lighter weight module.
>>>> 
>>>> 2) druid-sql's APIs are not intended to be stable, and probably never
>> will
>>>> be. They may break on minor releases. So updating the version of
>> druid-sql
>>>> in Calcite may involve tweaking how functions are called, etc. I think
>> this
>>>> effort should be minimal if calcite-druid is mostly just delegating to
>>>> druid-sql.
>>>> 
>>>> 3) druid-sql depends on calcite-core. This should usually be fine, but
>> it
>>>> means that if calcite-core has a breaking change, then calcite-druid
>> cannot
>>>> update its version of druid-sql until druid-sql first updates its
>> version
>>>> of calcite-core.
>>>> 
>>>> Despite these potential difficulties, I think the potential benefit
>> means
>>>> this is worth exploring.
>>>> 
>>>> Finally: a hypothetical. Why not do the other way around -- have Druid
>> add
>>>> calcite-druid as a dependency? The main reason is that this makes the
>> Druid
>>>> development process awkward when a new Druid SQL feature also requires a
>>>> new native query feature. Today, we develop the native query and SQL
>> sides
>>>> together. If Druid depended on calcite-druid, then we would need to
>> develop
>>>> the native query side first, then release it, then update Calcite's
>> Druid
>>>> adapter, then pull that back into Druid. Generally, just adding an extra
>>>> rule in druid-sql wouldn't be enough, since the sorts of changes we are
>>>> making at this point are typically more extensive than just adjusting
>>>> rules.
>>>> 
>>>> Gian
>>>> 
>> 


Mime
View raw message