drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [drill] cgivre commented on pull request #2092: DRILL-7763: Add Limit Pushdown to File Based Storage Plugins
Date Fri, 03 Jul 2020 14:19:49 GMT

cgivre commented on pull request #2092:
URL: https://github.com/apache/drill/pull/2092#issuecomment-653570450

   Thanks for taking a look.  
   > @cgivre, how it would work for the case when there was created multiple fragments
with their own scan? From the code, it looks like every fragment would read the same number
of rows specified in the limit. Also, will the limit operator be preserved in the plan if
the scan supports limit pushdown?
   Firstly, the format plugin has to explicitly enable the pushdown.  I don't have the best
test infrastructure, so maybe you could assist with that, but I do believe that each fragment
would read the same number of rows in their own scan.  Ideally, I'd like to fix that, but
let's say you have 5 scans that are reading files with 1000 rows and you put a limit of 100
on the query.  Without this PR, my observation was that Drill will still read 5000 rows, whereas
with this PR, it will only reduce that to 500.  
   > Metastore also provides capabilities for pushing the limit, but it works slightly
differently - it prunes files and leaves only minimum files number with specific row count.
Would these two features coexist and work correctly?
   I didn't know about this feature in the metastore.  I would like for these features to
coexist if possible.  Could you point me to some resources, or docs for this so that I can
take a look?  Ideally, I'd like to make it such that we get the minimum files number from
the metastore AND we get the row limit as well, so that we are looking at the absolute minimum
amount of data.
   For some background I was working on a project where I had several GB of PCAP files in
multiple directories.  I found that Drill could query these files fairly rapidly, but it seemed
to still have a lot of overhead in terms of how many files it was actually reading.  Separately,
when I was working on the Splunk plugin (https://github.com/apache/drill/pull/2089), I discovered
that virtually no storage plugins actually seemed to have a limit pushdown.  This was puzzling
since the rules and logic for this were actually already in Drill and in the GroupScan.  On
top of that, it's actually a fairly easy addition.  
   Getting back to this PR, I wanted to see if it made a performance difference on querying
some large files on my machine and the difference was shocking.  Simple queries and queries
with a `WHERE` clause, which used to take seconds, would now be virtually instantaneous. 
The difference is user experience is really shocking.  
   Anyway, I'd appreciate any help you can give with respect to the metastore and incorporating
that into the PR. 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:

View raw message