drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Rogers <par0...@yahoo.com.INVALID>
Subject Drill in the distributed compute jungle
Date Mon, 10 Sep 2018 05:21:08 GMT
Hi All,

Been reading up on distributed DB papers of late, including those passed along by this group.
Got me thinking about Arina's question about where Drill might go in the long term.

One thing I've noticed is that there are now quite a few distributed compute frameworks, many
of which support SQL in some form. A partial list would include Drill, Presto, Impala, Hive
LLAP, Spark SQL (sort of), Dremio, Alibaba MaxCompute, Microsoft's Dryad, Scope and StreamS,
Google's Dremel and BigQuery and F1, the batch version of Flink -- and those are just the
ones off the top of my head. Seems every big Internet shop has created one (Google, Facebook,
Alibaba, Microsoft, etc.)

There is probably some lesson in here for Drill. Being a distributed compute engine seems
to have become a commodity at this late stage of Big Data. But, it is still extremely hard
to build a distributed compute engine that scales, especially for a small project like Drill.

What unique value does Drill bring compared to the others? Certainly being open source. Being
in Java helps. Supporting xDBC is handy. Being able to scan any type of data is great (but
we tell people that when they get serious, they should use only Parquet).

As the team thinks about Arina's question about where Drill goes next, one wonders if there
is some way to share the load?  Rather than every project building its own DAG optimizer
and execution engine, its own distribution framework, its own scanners, its own implementation
of data types, and of SQL functions, etc., is there a way to combine efforts?

Ray [1] out of UC Berkeley is early days, but it promises to be exactly the highly scalable,
low-latency engine that Drill tries to be. Calcite is the universal SQL parser and optimizer.
Arrow wants to be the database toolkit, including data format, network protocol, etc. YARN,
Mesos, Kubernetes and others want to manage the cluster load. Ranger and Sentry want to do
data security. There are now countless storage formats (HDFS (classic, erasure coding, Ozone),
S3, ADLS, Ceph, MapR, Aluxio, Druid, Kudu and countless key-value stores. HMS is the metastore
we all love to hate and cries out for a newer, more scalable design -- but one shared by all
engines.

Then, on the compute side, SQL is just one (important) model. Spark and old-school MapReduce
can handle general data transform problems. Ray targets ML. Apex and Flink target streaming.
Then there are graph-query engines and more. Might these all be seen as different compute
models that run on top of a common scheduler, data, security, storage, and distribution framework?
Each has unique needs around shuffle, planning, duration, etc. But the underlying mechanisms
are really all quite similar.

Might Drill, in some far future form, be the project that combines these tools to create a
SQL compute engine? Rather than a little project like Drill trying to do it all (and Drill
is now talking about taking on the metastore challenge), perhaps Drill might evolve to get
out of the scheduling, distribution, data, networking, metadata and scan businesses to focus
on the surprising complexity of translating SQL to query plans that run on a common, shared
substrate.

Of course, for that to happen, there would need to actually BE a common substrate: probably
in the form of a variety of projects that tackle the different challenges listed above. The
HDFS client, Calcite, Arrow, Kubernetes and Ray projects are hints in that direction (though
all have obvious gaps.) Might such a rearranging of technologies be a next act for Apache
as the existing big data projects live out their lifecycles and the world shifts to the cloud
for commodity Hadoop?

Think how cool it would be to be able to grab a compute framework off the shelf and, say,
build a SQL++ engine for complex data? Or, meld ML and SQL to create some new hybrid? To be
able to run a query against multiple data sources, with standard row/column security, and
common metadata, without the need to write all this stuff yourself?

Not sure how we get there, but it is clear that Apache owning and maintaining many different
versions of the same technology just dilutes our very limited volunteer efforts.

Thoughts? Anyone know of open source efforts in this general direction?

Thanks,
- Paul

[1] https://rise.cs.berkeley.edu/projects/ray/





Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message