calcite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Hyde <jh...@apache.org>
Subject Fwd: [druid-dev] [Proposal] Built-in SQL for Druid
Date Wed, 12 Oct 2016 18:01:18 GMT
FYI - Druid devs are proposing to embed Calcite in Druid, so that Druid has a native SQL interface.

I think Gian’s plan makes sense; its architecture is similar in a lot of ways with how we
are integrating with Phoenix. Calcite’s Druid adapter will still exist, still be useful,
and in fact I expect that Druid devs will end up building on it.

I replied on the thread: https://groups.google.com/forum/?pli=1#!topic/druid-development/3npt9Qxpjr0
<https://groups.google.com/forum/?pli=1#!topic/druid-development/3npt9Qxpjr0> 

Julian

> Begin forwarded message:
> 
> From: Gian Merlino <gian@imply.io>
> Subject: [druid-dev] [Proposal] Built-in SQL for Druid
> Date: October 12, 2016 at 9:50:48 AM PDT
> To: druid-development@googlegroups.com
> Reply-To: druid-development@googlegroups.com
> 
> Inspired by the Calcite Druid adapter (https://groups.google.com/d/topic/druid-development/FK5D162ao74/discussion
<https://groups.google.com/d/topic/druid-development/FK5D162ao74/discussion>) I've been
playing around with something similar that lives inside of the Druid Broker. It seems promising,
so in this proposal I'm suggesting we include an official SQL server inside Druid itself.
> 
> I am hoping that we can:
> 
> 1) Use Calcite for SQL parsing and optimizing, and use Avatica (https://calcite.apache.org/docs/avatica_overview.html
<https://calcite.apache.org/docs/avatica_overview.html>) for the server and the JDBC
client.
> 2) Like the official Calcite Druid adapter, have a set of rules that push down filters,
projections, aggregations, sorts, etc into normal Druid queries.
> 3) Unlike the official Calcite Druid adapter, use Druid objects (like DimFilter, ExtractionFn,
etc) as model classes, since it avoids extra code, helps with type safety, and speeds up development.
> 4) Have this all run on the Broker, which would then make normal Druid queries to data
nodes.
> 5) Work towards being able to push down more and more SQL into normal Druid queries over
time.
> 
> Current status
> 
> If people are interested in this proposal then I'll clean up the code a bit and do a
PR. Currently it's a rough prototype. Some things that do work:
> 
> 1) Avatica handler running at /druid/v2/sql/ + Avatica JDBC driver
> 2) Determining column types with segment metadata queries
> 3) Pushing down operator sequences that look like filter -> project -> aggregate
-> project -> sort into groupBy, timeseries, and select queries as appropriate
> 4) Using "intervals" to filter on time when appropriate
> 5) LIKE, range, equality, and boolean filters
> 6) SUM, MIN, MAX, AVG, COUNT, COUNT DISTINCT
> 7) Some extraction fns like SUBSTRING, CHAR_LENGTH
> 8) GROUP BY FLOOR(__time TO gran) for time-series
> 9) Arithmetic post-aggregations
> 10) Filtered aggregations using CASE or using FILTER(WHERE ...)
> 11) Semi-joins like SELECT ... WHERE xxx IN (SELECT ...) can run by materializing the
inner result on the broker and applying it to the outer query as a filter. Obviously doesn't
always work, but it works sometimes (and it works more often than pulling the lefthand side
into the Broker…).
> 
> Non-exhaustive list of things that don't work:
> 
> 1) Pushing down filter after aggregate (HAVING)
> 2) Push down of anything without a native Druid analog, like multi-column extraction
fns, aggregation of expressions, window functions, etc.
> 3) Any extraction fns other than SUBSTRING, CHAR_LENGTH
> 4) A lot of time stuff, like x + INTERVAL, FLOOR(__time TO MONTH) = x, etc.
> 5) Query time lookups
> 6) Select with pagination – only the first 1000 results are used
> 7) Any sort of memory usage controls on the Broker side
> 
> FAQ
> 
> 1) Why another SQL on Druid thing? There's already, like, 7 of them.
> 
> I think the fact that there are 7 of them means there's clearly some value in having
a built-in implementation. Partially this is so we can hopefully share some work between the
projects. Partially this is because Druid doesn't support some things that are needed for
well rounded SQL support (like multi-column extraction fns, aggregations of expressions, etc)
and having the SQL layer inside the Druid repo will make it possible to develop those sorts
of features hand in hand with the SQL planner rules.
> 
> Btw, the 7 that I counted are, in alphabetical order, Calcite (https://calcite.apache.org/docs/druid_adapter.html
<https://calcite.apache.org/docs/druid_adapter.html>), Drill (https://groups.google.com/d/msg/druid-development/FK5D162ao74/EnYDjASWCQAJ
<https://groups.google.com/d/msg/druid-development/FK5D162ao74/EnYDjASWCQAJ>), Druid's
own simple grammar (added in 2013, removed in https://github.com/druid-io/druid/pull/2090
<https://github.com/druid-io/druid/pull/2090>), Hive (https://cwiki.apache.org/confluence/display/Hive/Druid+Integration
<https://cwiki.apache.org/confluence/display/Hive/Druid+Integration>), PlyQL (http://plywood.imply.io/plyql
<http://plywood.imply.io/plyql>), Sparkline (https://github.com/SparklineData/spark-druid-olap
<https://github.com/SparklineData/spark-druid-olap>), and Sql4D (https://github.com/srikalyc/Sql4D
<https://github.com/srikalyc/Sql4D>).
> 
> 2) Is the proposed SQL language actually SQL or is it "SQL-like"?
> 
> In terms of what can be efficiently pushed down to Druid queries, it's "SQL-like". A
lot of common SQL features aren't supported – although I think it makes sense to add more
over time. Technically Calcite does speak full SQL, but a lot of it at the start would get
planned as pulling all the raw data into the Broker and processing it in Calcite's interpreter.
> 
> 3) Why not use the Druid adapter in Calcite?
> 
> Calcite's Druid adapter doesn't depend on any Druid jars; it implements the query language
and protocol using its own set of model and client classes. For a builtin approach I wanted
to be able to use Druid's own Query, DimFilter, ExtractionFn, etc in a type-safe way, and
wanted to use the query code that already exists in Druid for discovering and querying data
nodes. I think this will also help speed up development of Druid features that allow more
SQL to be pushed down.
> 
> 4) Can we share work between a builtin Druid SQL and the other SQL on Druid adapters
that people are working on?
> 
> Hopefully! I think it would make sense if a builtin Druid SQL could be used for whatever
Druid supports natively, and external SQL on Druid adapters could be used when users want
to do something that Druid doesn't support. Sharing the work needed to translate "whatever
Druid supports natively" into Druid queries would help everyone.
> 
> Hive and Drill already use Calcite internally, and I hope it's workable to stuff Druid's
own rules into their planners without changing too much. If those projects are comfortable
embedding druid-server then that should work straight away. If they aren't comfortable embedding
druid-server (perhaps understandably) then we could bite the bullet and work on a light(er)
weight druid-client jar that has just enough to give us the benefit of type checking, and
does not include all the heavy Druid functionality.
> 
> If you're working on one of those projects, feedback is greatly appreciated.
> 
> 5) What happens when parts of the SQL query can't be converted to a native Druid query?
> 
> Calcite is rad and runs the parts that can't be pushed down through an interpreter on
the Druid Broker. Of course this means that if you use constructs that are close to the data
and can't be pushed down, like grouping on CONCAT(foo, bar) or aggregating SUM(3 * bar), potentially
a surprisingly large amount of data will be pulled out into the Broker. This is not great
behavior and something should be done about that…
> 
> 6) What about JOINs?
> 
> I don't know, maybe it makes sense for Druid to have query types usable for joins in
the future. But it doesn't now; the closest thing is query-time lookups, which is like a broadcast
join. Without native join support in Druid, it makes more sense to pull data out of Druid
into another system (like Drill or Hive or Spark) and do the join there. Even if Druid did
support native joins, there's still some value in using an external execution engine to join
Druid data with data from some other system. Filters and aggregations can still potentially
be pushed down, depending on the query.
> 
> 7) JDBC works, but what about ODBC?
> 
> Avatica's home page says work on an ODBC client has not yet started. The page at https://hortonworks.com/hadoop-tutorial/bi-apache-phoenix-odbc/
<https://hortonworks.com/hadoop-tutorial/bi-apache-phoenix-odbc/> is interesting, since
the Phoenix Query Server is also Avatica based, so maybe that work could be useful? However,
it doesn't seem to be open source, and when I tried to get the binary to work, the Windows
ODBC setup thing crashed after calling getTableTypes. Maybe someone at Hortonworks can comment
:)
> 
> Gian
> 
> -- 
> You received this message because you are subscribed to the Google Groups "Druid Development"
group.
> To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com
<mailto:druid-development+unsubscribe@googlegroups.com>.
> To post to this group, send email to druid-development@googlegroups.com <mailto:druid-development@googlegroups.com>.
> To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/CACZNdYAVv8WLP1Qw4dzzf60e-3CwzP_%3DFmWeOd_2OvPsfcw8Ag%40mail.gmail.com
<https://groups.google.com/d/msgid/druid-development/CACZNdYAVv8WLP1Qw4dzzf60e-3CwzP_%3DFmWeOd_2OvPsfcw8Ag%40mail.gmail.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message