calcite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Hyde <jh...@apache.org>
Subject Re: Embed druid-sql inside Calcite?
Date Wed, 07 Feb 2018 18:29:41 GMT
I agree with you both.

For a particular engine, such as Druid, there are often 3 options:

1. build a Calcite adapter to the engine's native query language;

2. if the engine supports SQL, connect to the engine via Calcite's JDBC adapter;

3. if the engine exposes an API based on Calcite algebra, connect to that API.

All of those options are valid for Druid right now, and 3 (Gian's
proposal) is likely to yield the best plans. As Gian correctly notes,
that is likely to increase the coupling, but we can live with that.
(If people want loose coupling they can talk to Druid via the JDBC
adapter, and we just need to make sure that the Druid JDBC dialect
knows that Druid cannot do joins.)

Nishant's core point seems to be that we need some kind of bulk
API/protocol to talk to Druid, to consume partial query results in
parallel. This is desirable because Hive is  -- how to put it
politely?! -- a "bigger" query engine. I'm sure that Spark, Presto and
Drill would want a similar API/protocol. When it exists, we can
generate a hybrid plan: Druid physical algebra that generates partial
results in parallel underneath Hive physical algebra that consumes
those results in parallel.

The same pattern occurred in Phoenix. Phoenix does not have
shuffle/exchange capabilities, so for big analytic queries we would
want to couple it with Hive/Spark/Presto/Drill. We talked about
Drillix (Drill + Phoenix) for a while but never completed it.

Julian


On Wed, Feb 7, 2018 at 9:07 AM, Nishant Bangarwa
<nishant.monu51@gmail.com> wrote:
> Having a focused effort into a single project would be great and would
> definitely help us in evolving druid sql capabilities faster.
>
> 1) One more thing that we need to consider here is that calcite
> druid-adapter is also used in Apache Hive where we use the druid rules to
> generate an optimized plan and then the druid query is executed from druid
> containers. In druid-sql I believe the query execution logic is tied to the
> fact that execution node is a druid-broker where native queries can be run
> to generate a Sequence of results. We might need some rework there to
> ensure that things work fine with hive too after proposed changes.
>
> 2) druid-sql dependencies can probably be reduced by separating the
> planning and execution logic in druid-sql, the planning logic need not
> depend on lots of druid code and can have light-weight dependencies while
> the execution part and result serde which pulls in lots of druid
> dependencies can reside in separate module and calcite druid-adapter need
> not depend on that module.
>
> I think, the hypothetical case you mentioned is also worth considering, to
> ease up the development process, we can consider moving calcite-druid as a
> module in druid, so that we make release of both druid-sql and
> calcite-adapter together.
>
> On Wed, 7 Feb 2018 at 09:02 Gian Merlino <gian@imply.io> wrote:
>
>> Hi Calcites,
>>
>> I would like to raise the idea of adding druid-sql (
>>
>> http://search.maven.org/#artifactdetails%7Cio.druid%7Cdruid-sql%7C0.11.0%7Cjar
>> )
>> as a dependency in Calcite's Druid adapter. It should reduce the size of
>> calcite-druid substantially, since it would mostly just be calling into
>> druid-sql.
>>
>> This has some advantages for both projects.
>>
>> 1) Support for new Druid features often appears in Druid SQL first. By
>> embedding druid-sql, Calcite gets these new features too, without extra
>> work. For example https://issues.apache.org/jira/browse/CALCITE-2170 is an
>> outstanding jira to add support for Druid expressions to Calcite, but
>> druid-sql already supports these. In fact it looks like some of the code in
>> the proposed patch is copied from druid-sql. As another example,
>> https://issues.apache.org/jira/browse/CALCITE-2077 switched table scans
>> from "select" to "scan", which had been previously done in Druid SQL in
>> https://github.com/druid-io/druid/pull/4751.
>>
>> 2) Depending on druid-sql means Calcite doesn't need to implement its own
>> Druid query and result serde code. Druid already has it.
>>
>> 3) Focused effort on a single module rather than the split effort that we
>> have today, where some developers are contributing to druid-sql and some
>> are contributing to calcite-druid.
>>
>> 4) More test coverage for both projects, presumably.
>>
>> I think (3) and (4) especially would give us the opportunity to improve
>> both projects much more rapidly.
>>
>> However, there are also some possible disadvantages.
>>
>> 1) druid-sql is a somewhat heavyweight module. It pulls in a lot of other
>> Druid code. Calcite users may prefer a lighter weight module.
>>
>> 2) druid-sql's APIs are not intended to be stable, and probably never will
>> be. They may break on minor releases. So updating the version of druid-sql
>> in Calcite may involve tweaking how functions are called, etc. I think this
>> effort should be minimal if calcite-druid is mostly just delegating to
>> druid-sql.
>>
>> 3) druid-sql depends on calcite-core. This should usually be fine, but it
>> means that if calcite-core has a breaking change, then calcite-druid cannot
>> update its version of druid-sql until druid-sql first updates its version
>> of calcite-core.
>>
>> Despite these potential difficulties, I think the potential benefit means
>> this is worth exploring.
>>
>> Finally: a hypothetical. Why not do the other way around -- have Druid add
>> calcite-druid as a dependency? The main reason is that this makes the Druid
>> development process awkward when a new Druid SQL feature also requires a
>> new native query feature. Today, we develop the native query and SQL sides
>> together. If Druid depended on calcite-druid, then we would need to develop
>> the native query side first, then release it, then update Calcite's Druid
>> adapter, then pull that back into Druid. Generally, just adding an extra
>> rule in druid-sql wouldn't be enough, since the sorts of changes we are
>> making at this point are typically more extensive than just adjusting
>> rules.
>>
>> Gian
>>

Mime
View raw message