drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Navin Bhawsar <navin.bhaw...@gmail.com>
Subject Re: FW: EXTERNAL: Re: Apache Drill Sizing guide
Date Thu, 16 Apr 2020 16:15:31 GMT
*Hi Paul,*

Thanks for your response.


I have tried to add more details as advised :

*Query Mix and selectivity*

Query mix will be max 30 concurrent users running adhoc reporting queries
via Drill Rest API called from ASP .Net Core(httpclient).

Query mix is combination of below query load running on server

1.       queries (5-10) aggregating data over (1 GB or 1-3M records)

2.       Majority of queries aggregating data 100k records (15-25)

Most of the queries are using simple filter clause and few using group by
on 10-15 columns out of 150 columns in  Parquet File.

*Performance expectation is these queries should be available in seconds
(<= 30 secs)*


*Partitioning* - Data is already partitioned on date and business level
with lower level include parquet files (200-300 MB,100 K records)



*Storage -*

VMDK(VMware Disk) with 1 TB Size

cached reads -  8000 MB/sec

buffered disk reads - 500 MB/sec

Drill queries parquet files on hdfs


Deployment - HDFS on-perm are hosted on Internal Cloud Platform (IaaS)
<https://alm-confluence.systems.uk.hsbc/confluence/pages/viewpage.action?pageId=51218200>
,spinning
new env will be quick.

Thanks,
Navin

>
> *From:* Paul Rogers <par0328@yahoo.com>
> *Sent:* Tuesday, April 14, 2020 12:41 AM
> *To:* user <user@drill.apache.org>
> *Cc:* arun.ns@gmail.com;
> *Subject:* EXTERNAL: Re: Apache Drill Sizing guide
>
>
>
> Hi Navin,
>
>
>
>
>
> Ted is absolutely right. To add a bit of context, here are some of the
> factors we've considered in the past.
>
>
>
>
>
> Queries: A simple filter scan takes the minimum resources: scan the
> tables, throw away most of the data, and deliver the rows that are needed.
> Such a use case is strongly driven by scan time. As Ted suggests,
> partitioning drives down scan cost. If every query hits the full TB of
> data, you will need many machines & disks to get adequate performance.
> Depending on your hardware, if you get 100 MB/s read performance per disk,
> it will take 10,000 seconds (three hours) to read your TB of data on one
> disk. If you have 100 disks, the time drops to 100 seconds. You didn't
> mention your storage technology: these numbers are likely entirely
> different for something like S3.
>
>
>
>
>
> So, you don't want to read the full TB. By using good partitioning
> (typically by date), you might reduce the scan by a factor of 1000. Huge
> win. And this is true whether you use Drill, Spark, Presto or Python to
> read your data.
>
>
>
>
>
> The next question is the selectivity of your queries. In the simple filter
> case, are you returning a few rows or a GB of rows? The more rows, the more
> Drill must grind through the data once it is read. This internal grinding
> requires CPU and benefits from parallelism. The amount you need depends on
> the number of rows processed per query.
>
>
>
>
>
> There is little memory needed for a pure filter query. Drill reads the
> data, tosses most rows, a returns the remainder to the client. Interesting
> queries, however, do more than filtering: they might group, join, sort and
> so on. Each of these operations carries its own cost. Joins are network
> heavy (to shuffle data). Sorts want enough memory to buffer the entire
> result set to avoid slow disk-based sorts.
>
>
>
>
>
> The query profile will provide lots of good information about the row
> count, memory usage and operators in each of your queries so you can
> determine the resources needed for each. When Ted asks you to analyze each
> query, the best way to do that is to look at the query profile and see
> which resources were needed by that query.
>
>
>
>
>
> Then, there are concurrent users. What do you mean by concurrent? 40
> people who might use Drill during the day so that only a few are active at
> the same time? Or, 40 users each watching dashboard that each run 10
> queries, updated each second, which will place a huge load on the system?
> Most humans are intermittent users. Dashboards, when overdone, can kill any
> system.
>
>
>
>
>
> Also, as Ted has said many times, if you run 40 queries a minute, and each
> takes 1 second, then concurrency turns into sequential processing. On the
> other hand, if one query uses all cluster resources for an hour, and you
> run 10 of them per hour, then the workload will fail.
>
>
>
>
>
> Once you determine the actual "concurrent concurrency" level (number of
> queries that run at the same time), work out the mix. Sum the resources for
> those concurrent queries. That tells you the cluster capacity you need
> (plus some safety margin because load is random.) Drill does have features
> to smooth out the load peaks by queuing queries. Not state-of-the-art, but
> can prevent the inevitable overloads that occur at random peak loads when
> there is not sufficient reserve capacity.
>
>
>
>
>
> You didn't mention your deployment model. In classic Hadoop days, with an
> on-prem cluster, you had to work all this out ahead of time so you could
> plan your equipment purchases 3 to 6 months in advance. In the cloud,
> however, especially with K8s, you just resize the cluster based on demand.
> Drill is not quite there yet with our K8s integration, but the team is
> making good progress and we should have a solution soon;
> contributions/feedback would be very helpful.
>
>
>
>
>
> In short, there are many factors, some rather complex. (We all know it
> should be simple, but having done this with many DBs, it just turns out
> that it never is.)
>
>
>
>
>
> We'd be happy to offer pointers if you can offer a few more specifics.
> Also, perhaps we can distill this discussion into a few pages in the Drill
> docs.
>
>
>
>
>
> Thanks,
>
>
>
> - Paul
>
>
>
>
>
> On Monday, April 13, 2020, 7:59:08 AM PDT, Ted Dunning <
> ted.dunning@gmail.com> wrote:
>
>
>
>
>
> Navin,
>
> Your specification of 40 concurrent users and data size are only a bit less
> than half the story. Without the rest of the story, nobody will be able to
> give you even general guidance beyond a useless estimate that it will take
> between roughly 1 and 40 drillbits with with a gob of memory.
>
> To do better than such non-specific "guidance", you need to add some
> additional answers. For example,
>
> What is the query mix?
> How long do these queries run without any question of concurrency?
> Could that query speed be enhanced with better partitioning?
> How are you storing your data?
> What promises are you making to these concurrent users?
>
>
> On Mon, Apr 13, 2020 at 7:21 AM Navin Bhawsar <navin.bhawsar@gmail.com>
> wrote:
>
> > Hi Team ,
> >
> > We are planning to use drill to query hdfs cluster with about a terabyte
> > data in parquet file format .There will be approx. 40 concurrent users
> > using this environment .
> >
> > Can you please advise on below two points considering above workload for
> > optimum performance:
> > 1. Number of Drill bits
> > 2. Memory Config per drill bit
> >
> > Also is there any sizing guide for Apache Drill to refer.
> >
> >
> > Thanks,
> > Navin
> >
> ------------------------------
> ------------------------------
>
>
>
>
>
>
>
>
>
>
>
>
> * ****************************************************************** This
> message originated from the Internet. Its originator may or may not be who
> they claim to be and the information contained in the message and any
> attachments may or may not be accurate.
> *******************************************************************
>
>
>
> ------------------------------
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> * ******************************************************************* This
> e-mail is confidential. It may also be legally privileged. If you are not
> the addressee you may not copy, forward, disclose or use any part of it. If
> you have received this message in error, please delete it and all copies
> from your system and notify the sender immediately by return e-mail.
> Internet communications cannot be guaranteed to be timely, secure, error or
> virus-free. The sender does not accept liability for any errors or
> omissions.
> ******************************************************************* "SAVE
> PAPER - THINK BEFORE YOU PRINT!" *
>
>

-- 
Navin Bhawsar

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message