drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Rogers <par0...@yahoo.com.INVALID>
Subject Re: Help with time taken to generate parquet file.
Date Thu, 12 Mar 2020 06:25:42 GMT
Hi Vishwajeet,

Welcome to the Drill community. As it turns out, our mailing list does not forward images.
But, fortunately, the moderation software preserved the images so I was able to find them.
Let me tackle your questions one by one.

Like all generalizations, saying "Drill needs lots of memory" is relative. The statement applies
to a production system, running against large files, with many concurrent users. It probably
does not apply to your local machine running a few sample queries.

What drives memory usage? It is not just file size. It is the buffered size. If you scan 1TB
of data with a simple query with only a WHERE clause, Drill will use very little memory. But,
if you sort the 1TB of data, Drill will obviously need lots of memory to perform the sort.
For sort (and several other operations), if there is not enough memory, Drill will spill to
disk, which is slow. (At least three IOs for each block of data instead of just one.)

Second, the variable you used to set memory:


Is not the documented way to set memory. See [1] for the preferred approach. Looks like your
approach works; but probably because you are running an embedded-mode Drillbit.

Just to emphasize this: Drill works fine as an embedded desktop tool. But, it is designed
to run well on clusters, with distributed storage and multiple machines all working away on
large queries.

To assign memory, consider your use case. Your second image is a screen shot of one line of
the Drill web console showing the Drillbit using .2GB of 8GB of heap, 0GB of direct memory,
and basically 0% CPU. You did not say if this is during a query or between queries. I assume
it is between queries.

You mention that you want to "reduce file generation time", but you did not state the kind
of file you are reading, or the expected sizes of the input and output files. (The message
title does state the output is Parquet.) I'll guess that both files reside on your local machine.
So, depending on disk type (SSD or HDD), you can expect maybe 50 MB/s (HDD) to 200MB/s (SSD)
IO throughput. If you want to process a 1GB file, you will need to do 2GB of I/O. At 100MB/s,
it will take 20 seconds just for the I/O, maybe more if the HDD starts seeking between input
and output files. This is why a production Drillbit runs on multiple servers: to spread out
the I/O.

Another issue might be that your input is all one big file. In this case, Drill will run in
a single thread, with no parallelism. Drill works better if your input is divided into multiple
files. (Or, multiple blocks in HDFS or S3.) On the local system, create a directory that contains
your file split into four, eight or more chunks. That way, Drill can put all your CPUs to
work for CPU-intensive tasks such as filtering, computing values, and so on.

At times like this, the query profile is your friend. The amount of information can be overwhelming.
Look at the total run time. Then, look at the time in the various operators. Which ones take
time? Only the scan and root (the root writes your output file)? Or, do you have a join, sort,
or other complex operation? How much parallelism are you getting? You would prefer to keep
all your CPUs busy.

These are a few hints to help you get started. Please feel free to report back your findings
and perhaps give us a bit more of a description of what you are trying to accomplish.

- Paul

[1] https://drill.apache.org/docs/configuring-drill-memory/


    On Wednesday, March 11, 2020, 5:42:39 AM PDT, Vishwajeet Anantvilas SONUNE <vishwajeet.anantvilas.sonune@hsbc.co.in>
Hi Team,
Learned about apache drill that – ‘Drill is memory intensive and therefore requires sufficient
memory to run optimally. You can modify how much memory that you want allocated to Drill.
Drill typically performs better with as much memory as possible.’
With this I tried allocating as much as memory I could for drill to run. I’m running the
drill on my local machine so configured the JAVA_TOOL_OPTIONS to 8GB as Environment variable.
Which in turn increased the heap memory. 

While running a query for generating a parquet file from SQL Server having millions of record,
the drill just uses 3 – 4 % of heap memory. Any ways there is no increase in the performance(reduce
in time of file generation).
Can you please let us know if there’s a way to reduce the file generation time?

Please let me know if any further details are required.
Looking for your reply to Vishwajeet Anantvilas SONUNE <vishwajeet.anantvilas.sonune@hsbc.co.in>
Vishwajeet Sonune

This e-mail is confidential. It may also be legally privileged. 
If you are not the addressee you may not copy, forward, disclose 
or use any part of it. If you have received this message in error, 
please delete it and all copies from your system and notify the 
sender immediately by return e-mail.

Internet communications cannot be guaranteed to be timely, 
secure, error or virus-free. The sender does not accept liability 
for any errors or omissions.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message