drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Timothy Farkas <tfar...@mapr.com>
Subject Re: Drill in the distributed compute jungle
Date Mon, 10 Sep 2018 19:37:05 GMT
It's an interesting idea, and I think the main inhibitor that prevents this
from happening is that the popular big data projects are stuck on services.
Specifically if you need distributed coordination you run a separate
zookeeper cluster. If you need a batch compute engine you run a separate
spark cluster. If you need a streaming engine you deploy a separate Flink
or Apex pipeline. If you want to reuse and combine all these services to
make a new engine, you find yourself maintaining several different clusters
of machines, which just isn't practical.

IMO the paradigm needs to shift from services to libraries. If you need
distributed coordination import the zookeeper library and start the
zookeeper client, which will run zookeeper threads and turn your
application process into a member of the zookeeper quorum. If you need
compute import the compute engine library and start the compute engine
client and your application node will also turn into a worker node. When
you start a library it will discover the other nodes in your application to
form a cohesive cluster. I think this shift has already begun. Calcite is a
library, not a query planning service. Also etcd allows you to run an etcd
instance in your application's process using a simple function call. Arrow
is also a library, not a service. And Apache Ignite is a compute engine
that allows you to run the Ignite compute engine in your application's
process https://ignite.apache.org/ .

If we shift to thinking of libraries instead of services, then it becomes
trivial to build new engines, since new engines would just be a library
that depends on other libraries. Also you no longer manage several
services, you only manage the service that you built.

>From the little I read about ray, it seems like ray is also moving in the
library direction.

Tim



On Sun, Sep 9, 2018 at 10:21 PM Paul Rogers <par0328@yahoo.com.invalid>
wrote:

> Hi All,
>
> Been reading up on distributed DB papers of late, including those passed
> along by this group. Got me thinking about Arina's question about where
> Drill might go in the long term.
>
> One thing I've noticed is that there are now quite a few distributed
> compute frameworks, many of which support SQL in some form. A partial list
> would include Drill, Presto, Impala, Hive LLAP, Spark SQL (sort of),
> Dremio, Alibaba MaxCompute, Microsoft's Dryad, Scope and StreamS, Google's
> Dremel and BigQuery and F1, the batch version of Flink -- and those are
> just the ones off the top of my head. Seems every big Internet shop has
> created one (Google, Facebook, Alibaba, Microsoft, etc.)
>
> There is probably some lesson in here for Drill. Being a distributed
> compute engine seems to have become a commodity at this late stage of Big
> Data. But, it is still extremely hard to build a distributed compute engine
> that scales, especially for a small project like Drill.
>
> What unique value does Drill bring compared to the others? Certainly being
> open source. Being in Java helps. Supporting xDBC is handy. Being able to
> scan any type of data is great (but we tell people that when they get
> serious, they should use only Parquet).
>
> As the team thinks about Arina's question about where Drill goes next, one
> wonders if there is some way to share the load?  Rather than every project
> building its own DAG optimizer and execution engine, its own distribution
> framework, its own scanners, its own implementation of data types, and of
> SQL functions, etc., is there a way to combine efforts?
>
> Ray [1] out of UC Berkeley is early days, but it promises to be exactly
> the highly scalable, low-latency engine that Drill tries to be. Calcite is
> the universal SQL parser and optimizer. Arrow wants to be the database
> toolkit, including data format, network protocol, etc. YARN, Mesos,
> Kubernetes and others want to manage the cluster load. Ranger and Sentry
> want to do data security. There are now countless storage formats (HDFS
> (classic, erasure coding, Ozone), S3, ADLS, Ceph, MapR, Aluxio, Druid, Kudu
> and countless key-value stores. HMS is the metastore we all love to hate
> and cries out for a newer, more scalable design -- but one shared by all
> engines.
>
> Then, on the compute side, SQL is just one (important) model. Spark and
> old-school MapReduce can handle general data transform problems. Ray
> targets ML. Apex and Flink target streaming. Then there are graph-query
> engines and more. Might these all be seen as different compute models that
> run on top of a common scheduler, data, security, storage, and distribution
> framework? Each has unique needs around shuffle, planning, duration, etc.
> But the underlying mechanisms are really all quite similar.
>
> Might Drill, in some far future form, be the project that combines these
> tools to create a SQL compute engine? Rather than a little project like
> Drill trying to do it all (and Drill is now talking about taking on the
> metastore challenge), perhaps Drill might evolve to get out of the
> scheduling, distribution, data, networking, metadata and scan businesses to
> focus on the surprising complexity of translating SQL to query plans that
> run on a common, shared substrate.
>
> Of course, for that to happen, there would need to actually BE a common
> substrate: probably in the form of a variety of projects that tackle the
> different challenges listed above. The HDFS client, Calcite, Arrow,
> Kubernetes and Ray projects are hints in that direction (though all have
> obvious gaps.) Might such a rearranging of technologies be a next act for
> Apache as the existing big data projects live out their lifecycles and the
> world shifts to the cloud for commodity Hadoop?
>
> Think how cool it would be to be able to grab a compute framework off the
> shelf and, say, build a SQL++ engine for complex data? Or, meld ML and SQL
> to create some new hybrid? To be able to run a query against multiple data
> sources, with standard row/column security, and common metadata, without
> the need to write all this stuff yourself?
>
> Not sure how we get there, but it is clear that Apache owning and
> maintaining many different versions of the same technology just dilutes our
> very limited volunteer efforts.
>
> Thoughts? Anyone know of open source efforts in this general direction?
>
> Thanks,
> - Paul
>
> [1]
> https://urldefense.proofpoint.com/v2/url?u=https-3A__rise.cs.berkeley.edu_projects_ray_&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=4eQVr8zB8ZBff-yxTimdOQ&m=XoNNIKuw9d-k4-4DMVxRL4oRAnyP3YR8N5iSSY0X1Io&s=T33zCvWoxmkbTmerrTV4ZNujrFqv9vWJu6zjXpbgnIA&e=
>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message