drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Akshay Bhasin (BLOOMBERG/ 731 LEX)" <abhasi...@bloomberg.net>
Subject Re: Feature/Question
Date Tue, 01 Jun 2021 13:48:03 GMT
"I believe that any of the 5 drill machines can handle queries completely symmetrically. When
a query is received, the planning is done and execution fragments are scheduled on the other
nodes."

Thats interesting - I've not come across this. I currently have a lot of history of data partitioned
in s3, and I've tried different queries - however the load has NOT been parallelized/distributed.

Therefore - I had an understanding distributed mode refers to multiple queries being able
to run on a single node - but not the other way around.

I was running the queries from drill-localhost instance from one of the nodes. Is there something
else I should try to achieve the above behavior ?

From: ted.dunning@gmail.com At: 05/30/21 18:51:03 UTC-4:00To:  Akshay Bhasin (BLOOMBERG/ 731
LEX ) 
Cc:  dev@drill.apache.org
Subject: Re: Feature/Question


Here are some more answers:


On Thu, May 27, 2021 at 10:09 AM Akshay Bhasin (BLOOMBERG/ 731 LEX) <abhasin16@bloomberg.net>
wrote:

Hi Ted, 

Yes sure - below are the 2 reasons for it - 

1) If I run 5 drill machines in a cluster, all connected to a single end point at s3, I'll
have to use the machines to create the parquet files. Now, there are 2 sub questions here
- 

         - I'm not sure if a single drill end point is exposed for me to query  ... a unique
cluster ID I can use where all requests will be load balanced ?

I believe that any of the 5 drill machines can handle queries completely symmetrically. When
a query is received, the planning is done and execution fragments are scheduled on the other
nodes.

As such, you can either build a load balancer in front of the cluster or you can do roughly
the same thing using DNS round-robin. It won't make a lot of difference, in practice, though
because the load is spread around pretty well even if only one node does all of the planning
(at least if your queries involve a lot of work).


         - What if the node goes down ? For instance, on a single node (say A in above example)
- one user is running a read query & at the same time I run a create table query ? That
would block and congest the node. 

If a node goes down, any query involving it will fail. The loss will be detected in a few
seconds and any queries accepted by the cluster during that time may hang up a little bit.
Once the failure has been detected, operation will continue without any problems. Clients
may or may not retry their queries automatically (I think that most won't).
 

2) This is a minor one - and I could be wrong - I'm not sure drill can write to s3 bucket.
I think you can only put/upload files there, you cannot write to it. 

Charles' answer was on the mark here.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message