drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gareth Western <gar...@garethwestern.com>
Subject What is the most memory-efficient technique for selecting several million records from a CSV file
Date Fri, 23 Oct 2020 06:27:19 GMT
I have a very large CSV file (nearly 13 million records) stored in Azure Storage and read via
the Azure Storage plugin. The drillbit configuration has a modest 4GB heap size. Is there
an effective way to select all the records from the file without running out of resources
in Drill?

SELECT * … is too big

SELECT * with OFFSET and LIMIT sounds like the right approach, but OFFSET still requires scanning
through the offset records, and this seems to hit the same memory issues even with small LIMITs
once the offset is large enough.

Would it help to switch the format to something other than CSV? Or move it to a different
storage mechanism? Or something else?

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message