drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jinfeng Ni <...@apache.org>
Subject Re: Drill in the distributed compute jungle
Date Wed, 12 Sep 2018 02:47:37 GMT
The idea of making library in stead of service sounds to me very
interesting.  If we could abstract Drill's run-time relational operators
into an independent library, people could easily leverage it to build their
own sql query engine, together with some scheduling service, in stead of
starting from scratch to build a new query service. If we move towards
Arrow, the run-time library would mean standard + customized relational
operators, built on top of standardized  in-memory format.

On Mon, Sep 10, 2018 at 7:01 PM Paul Rogers <par0328@yahoo.com.invalid>

> Hi Tim,
> You said it very well: think in terms of libraries, not services. This is
> exactly the right perspective.
> It is important to recognize that Drill, itself, was created by leveraging
> many different libraries. But, at the time, it seems no big data libraries
> existed. In fact, Drill IS a library; the Drillbit is a thin wrapper around
> Drill. Drill runs just as well embedded in a JDBC driver, launched by a
> unit test, etc.
> Suppose Drill shifted to Arrow. Then, Drill could turn all of the work
> invested in scanners into generally available libraries to read Parquet (or
> CSV or HBase or Kafka) into Arrow vectors. What else might Drill do?
> Not sure how we proceed, but this sure feels like the future of big data
> compute engines.
> Thanks,
> - Paul
>     On Monday, September 10, 2018, 12:37:24 PM PDT, Timothy Farkas <
> tfarkas@mapr.com> wrote:
>  It's an interesting idea, and I think the main inhibitor that prevents
> this
> from happening is that the popular big data projects are stuck on services.
> Specifically if you need distributed coordination you run a separate
> zookeeper cluster. If you need a batch compute engine you run a separate
> spark cluster. If you need a streaming engine you deploy a separate Flink
> or Apex pipeline. If you want to reuse and combine all these services to
> make a new engine, you find yourself maintaining several different clusters
> of machines, which just isn't practical.
> IMO the paradigm needs to shift from services to libraries. If you need
> distributed coordination import the zookeeper library and start the
> zookeeper client, which will run zookeeper threads and turn your
> application process into a member of the zookeeper quorum. If you need
> compute import the compute engine library and start the compute engine
> client and your application node will also turn into a worker node. When
> you start a library it will discover the other nodes in your application to
> form a cohesive cluster. I think this shift has already begun. Calcite is a
> library, not a query planning service. Also etcd allows you to run an etcd
> instance in your application's process using a simple function call. Arrow
> is also a library, not a service. And Apache Ignite is a compute engine
> that allows you to run the Ignite compute engine in your application's
> process https://ignite.apache.org/ .
> If we shift to thinking of libraries instead of services, then it becomes
> trivial to build new engines, since new engines would just be a library
> that depends on other libraries. Also you no longer manage several
> services, you only manage the service that you built.
> From the little I read about ray, it seems like ray is also moving in the
> library direction.
> Tim
> On Sun, Sep 9, 2018 at 10:21 PM Paul Rogers <par0328@yahoo.com.invalid>
> wrote:
> > Hi All,
> >
> > Been reading up on distributed DB papers of late, including those passed
> > along by this group. Got me thinking about Arina's question about where
> > Drill might go in the long term.
> >
> > One thing I've noticed is that there are now quite a few distributed
> > compute frameworks, many of which support SQL in some form. A partial
> list
> > would include Drill, Presto, Impala, Hive LLAP, Spark SQL (sort of),
> > Dremio, Alibaba MaxCompute, Microsoft's Dryad, Scope and StreamS,
> Google's
> > Dremel and BigQuery and F1, the batch version of Flink -- and those are
> > just the ones off the top of my head. Seems every big Internet shop has
> > created one (Google, Facebook, Alibaba, Microsoft, etc.)
> >
> > There is probably some lesson in here for Drill. Being a distributed
> > compute engine seems to have become a commodity at this late stage of Big
> > Data. But, it is still extremely hard to build a distributed compute
> engine
> > that scales, especially for a small project like Drill.
> >
> > What unique value does Drill bring compared to the others? Certainly
> being
> > open source. Being in Java helps. Supporting xDBC is handy. Being able to
> > scan any type of data is great (but we tell people that when they get
> > serious, they should use only Parquet).
> >
> > As the team thinks about Arina's question about where Drill goes next,
> one
> > wonders if there is some way to share the load?  Rather than every
> project
> > building its own DAG optimizer and execution engine, its own distribution
> > framework, its own scanners, its own implementation of data types, and of
> > SQL functions, etc., is there a way to combine efforts?
> >
> > Ray [1] out of UC Berkeley is early days, but it promises to be exactly
> > the highly scalable, low-latency engine that Drill tries to be. Calcite
> is
> > the universal SQL parser and optimizer. Arrow wants to be the database
> > toolkit, including data format, network protocol, etc. YARN, Mesos,
> > Kubernetes and others want to manage the cluster load. Ranger and Sentry
> > want to do data security. There are now countless storage formats (HDFS
> > (classic, erasure coding, Ozone), S3, ADLS, Ceph, MapR, Aluxio, Druid,
> Kudu
> > and countless key-value stores. HMS is the metastore we all love to hate
> > and cries out for a newer, more scalable design -- but one shared by all
> > engines.
> >
> > Then, on the compute side, SQL is just one (important) model. Spark and
> > old-school MapReduce can handle general data transform problems. Ray
> > targets ML. Apex and Flink target streaming. Then there are graph-query
> > engines and more. Might these all be seen as different compute models
> that
> > run on top of a common scheduler, data, security, storage, and
> distribution
> > framework? Each has unique needs around shuffle, planning, duration, etc.
> > But the underlying mechanisms are really all quite similar.
> >
> > Might Drill, in some far future form, be the project that combines these
> > tools to create a SQL compute engine? Rather than a little project like
> > Drill trying to do it all (and Drill is now talking about taking on the
> > metastore challenge), perhaps Drill might evolve to get out of the
> > scheduling, distribution, data, networking, metadata and scan businesses
> to
> > focus on the surprising complexity of translating SQL to query plans that
> > run on a common, shared substrate.
> >
> > Of course, for that to happen, there would need to actually BE a common
> > substrate: probably in the form of a variety of projects that tackle the
> > different challenges listed above. The HDFS client, Calcite, Arrow,
> > Kubernetes and Ray projects are hints in that direction (though all have
> > obvious gaps.) Might such a rearranging of technologies be a next act for
> > Apache as the existing big data projects live out their lifecycles and
> the
> > world shifts to the cloud for commodity Hadoop?
> >
> > Think how cool it would be to be able to grab a compute framework off the
> > shelf and, say, build a SQL++ engine for complex data? Or, meld ML and
> > to create some new hybrid? To be able to run a query against multiple
> data
> > sources, with standard row/column security, and common metadata, without
> > the need to write all this stuff yourself?
> >
> > Not sure how we get there, but it is clear that Apache owning and
> > maintaining many different versions of the same technology just dilutes
> our
> > very limited volunteer efforts.
> >
> > Thoughts? Anyone know of open source efforts in this general direction?
> >
> > Thanks,
> > - Paul
> >
> > [1]
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__rise.cs.berkeley.edu_projects_ray_&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=4eQVr8zB8ZBff-yxTimdOQ&m=XoNNIKuw9d-k4-4DMVxRL4oRAnyP3YR8N5iSSY0X1Io&s=T33zCvWoxmkbTmerrTV4ZNujrFqv9vWJu6zjXpbgnIA&e=
> >
> >
> >
> >
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message