drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nitin Pawar <nitinpawar...@gmail.com>
Subject Re: What is the most memory-efficient technique for selecting several million records from a CSV file
Date Fri, 23 Oct 2020 06:30:14 GMT
Please convert CSV to parquet first and while doing so make sure you cast
each column to correct datatype

once you have in paraquet, your queries should be bit faster.

On Fri, Oct 23, 2020, 11:57 AM Gareth Western <gareth@garethwestern.com>

> I have a very large CSV file (nearly 13 million records) stored in Azure
> Storage and read via the Azure Storage plugin. The drillbit configuration
> has a modest 4GB heap size. Is there an effective way to select all the
> records from the file without running out of resources in Drill?
> SELECT * … is too big
> SELECT * with OFFSET and LIMIT sounds like the right approach, but OFFSET
> still requires scanning through the offset records, and this seems to hit
> the same memory issues even with small LIMITs once the offset is large
> enough.
> Would it help to switch the format to something other than CSV? Or move it
> to a different storage mechanism? Or something else?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message