drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Rogers <par0...@yahoo.com.INVALID>
Subject Re: Drill on YARN Questions
Date Wed, 02 Jan 2019 02:18:06 GMT
Hi Charles,

Your engineers have identified a common need, but one which is very difficult to satisfy.

TL;DR: DoY gets as close to the requirements as possible within the constraints of YARN and
Drill. But, future projects could do more.

Your engineers want resource segregation among tenants: multi-tenancy. This is very difficult
to achieve at the application level. Consider Drill. It would need some way to identify users
to know which tenant they belong to. Then, Drill would need a way to enqueue users whose queries
would exceed the memory or CPU limit for that tenant. Plus, Drill would have to be able to
limit memory and CPU for each query. Much work has been done to limit memory, but CPU is very
difficult. Mature products such as Teradata can do this, but Teradata has 40 years of effort
behind it.

Since it is hard to build multi-tenancy in at the app level (not impossible, just very, very
hard), the thought is to apply it at the cluster level. This is done in YARN via limiting
the resources available to processes (typically map/reduce) and to limit the number of running
processes. Works for M/R because each map task uses disk to shuffle results to a reduce task,
so map and reduce tasks can run asynchronously.

For tools such as Drill, which do in-memory processing (really, across-the-network exchanges),
both the sender and receiver have to run concurrently. This is much harder to schedule than
async m/r tasks: it means that the entire Drill cluster (of whatever size) be up and running
to run a query.

The start-up time for Drill is far, far longer than a query. So, it is not feasible to use
YARN to launch a Drill cluster for each query the way you would do with Spark. Instead, under
YARN, Drill is a long running service that handles many queries.

Obviously, this is not ideal: I'm sure your engineers want to use a tenant's resources for
Drill when running queries, else for Spark, Hive, or maybe TensorFlow. If Drill has to be
long-running, I'm sure they's like to slosh resources between tenants as is done in YARN.
As noted above, this is a hard problem that DoY did not attempt to solve.

One might suggest that Drill grab resources from YARN when Tenant A wants to run a query,
and release them when that tenant is done, grabbing new resources when Tenant B wants to run.
Impala tried this with Llama and found it did not work. (This is why DoY is quite a bit simpler;
no reason to rerun a failed experiment.)

Some folks are looking to Kubernetes (K8s) as a solution. But, that just replaces YARN with
K8s: Drill is still a long-running process.

To solve the problem you identify, you'll need either:

* A bunch of work in Drill to build multi-tenancy into Drill, or
* A cloud-like solution in which each tenant spins up a Drill cluster within its budget, spinning
it down, or resizing it, to stay with an overall budget.

The second option can be achieved under YARN with DoY, assuming that DoY added support for
graceful shutdown (or the cluster is reduced in size only when no queries are active.) Longer-term,
a more modern solution would be Drill-on-Kubernetes (DoK?) which Abhishek started on.

Engineering is the art of compromise. The question for your engineers is how to achieve the
best result given the limitations of the software available today. At the same time, helping
the Drill community improve the solutions over time.

Thanks,
- Paul

 

    On Sunday, December 30, 2018, 9:38:04 PM PST, Charles Givre <cgivre@gmail.com> wrote:
 
 
 Hi Paul, 
Here’s what our engineers said:

>From Paul’s response, I understand that there is a slight confusion around how multi-tenancy
has been enabled in our data lake.

Some more details on this – 

Drill already has the concept of multitenancy where we can have multiple drill clusters running
on the same data lake enabled through different ports and zookeeper. But, all of this is launched
through the same hard coded yarn queue that we provide as a config parameter.

In our data lake, each tenant has a certain amount of compute capacity allotted to them which
they can use for their project work. This is provisioned through individual YARN queues for
each tenant (resource caging). This restricts the tenants from using cluster resources beyond
a certain limit and not impacting other tenants at the same time. 

Access to these YARN queues is provisioned through ACL memberships. 

——

Does this make sense?  Is this possible to get Drill to work in this manner, or should we
look into opening up JIRAs and working on new capabilities?



> On Dec 17, 2018, at 21:59, Paul Rogers <par0328@yahoo.com.INVALID> wrote:
> 
> Hi Kwizera,
> I hope my answer to Charles gave you the information you need. If not, please check out
the DoY documentation or ask follow-up questions.
> Key thing to remember: Drill is a long-running YARN service; queries DO NOT go through
YARN queues, they go through Drill directly.
> 
> Thanks,
> - Paul
> 
> 
> 
>    On Monday, December 17, 2018, 11:01:04 AM PST, Kwizera hugues Teddy <nbted2017@gmail.com>
wrote:  
> 
> Hello,
> Same questions ,
> I would like to know how drill deal with this yarn fonctionality?
> Cheers.
> 
> On Mon, Dec 17, 2018, 17:53 Charles Givre <cgivre@gmail.com wrote:
> 
>> Hello all,
>> We are trying to set up a Drill cluster on our corporate data lake.  Our
>> cluster requires dynamic YARN queue allocation for multi-tenant
>> environment.  Is this something that Drill supports or is there a
>> workaround?
>> Thanks!
>> —C  
  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message