drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Rogers <par0...@gmail.com>
Subject Re: What is the most memory-efficient technique for selecting several million records from a CSV file
Date Mon, 26 Oct 2020 18:00:26 GMT
Hi Gareth,

Glad you found a solution! The fix for the REST API is to convert it from a
buffered form to streaming. As noted, the current version buffers all rows
in memory, then formats the JSON response. The fix is to stream the JSON
out as the REST API receives batches from the rest of Drill. Doing so
avoids the memory pressure problem.

As you note, it is a partial fix. The REST API appends the schema to the
JSON message after all the data. If your code needs that schema, then your
code must buffer the full result set to wait for that schema. We decided
not to change this behavior to avoid breaking existing code. We could,
however, create a new API version that sends the schema before the data.

Sending the schema first has the problem I outlined in my response to
Dobes: Drill will change the schema as the query runs: the first schema may
not contain all the columns that will occur later in the query. (JDBC/ODBC
have this same issue, by the way.)

With Drill, there is no free lunch. It's all a matter of trade-offs.

Thanks,

- Paul

On Mon, Oct 26, 2020 at 2:10 AM Gareth Western <gareth@garethwestern.com>
wrote:

> Hi Paul,
>
> What is the "partial fix" related to in the REST API? The API has worked
> fine for our needs, except in the case I mentioned where we would like to
> select 12 million records all at once. I don't think this type of query
> will ever work with the REST API until the API supports a streaming
> protocol (e.g. gRPC or rsocket), right?
>
> Regarding the cleaning, I found out that there is actually a small
> cleaning step when the CSV is first created, so it should be possible to
> use this stage to convert the data to a format such as Parquet.
>
> Regarding the immediate solution for my problem, I got the JDBC driver
> working with Python using the JayDeBeApi library, and can keep the memory
> usage down by using the fetchMany method to stream batches of results from
> the server:
> https://gist.github.com/gdubya/a2489e4b9451720bb2be996725ce35bb
>
> Mvh,
> Gareth
>
> On 23/10/2020, 22:44, "Paul Rogers" <par0328@gmail.com> wrote:
>
>     Hi Gareth,
>
>     The REST API is handy. We do have a partial fix queued up, but it got
>     stalled a bit because of lack of reviewers for the tricky part of the
> code
>     that is impacted. If the REST API will help your use case; perhaps you
> can
>     help with review of the fix, or trying it out in your environment.
> You'd
>     need a custom Drill build, but doing that is pretty easy.
>
>     One other thing to keep in mind: Drill will ready many kinds of "raw"
> data.
>     But, Drill does require that the data be clean. For CSV, that means
>     consistent columns and consistent formatting. (A column cannot be a
> number
>     in one place, and a string in another. If using headers, a column
> cannot be
>     called "foo" in one file, and "bar" in another.) If your files are
> messy,
>     it is very helpful to run an ETL step to clean up the data so you
> don't end
>     up with random failed queries. Since the data is rewritten for
> cleaning,
>     you might as well write the output to Parquet as Nitin suggests.
>
>     - Paul
>
>
>
>     On Fri, Oct 23, 2020 at 2:54 AM Gareth Western <
> gareth@garethwestern.com>
>     wrote:
>
>     > Thanks Paul and Nitin.
>     >
>     > Yes, we are currently using the REST API, so I guess that caveat is
> the
>     > main issue. I am experimenting with JDBC and ODBC, but haven't made a
>     > successfully connection with those from our Python apps yet (issues
> not
>     > related to Drill but with the libraries I'm trying to use).
>     >
>     > Our use case for Drill is using it to expose some source data files
>     > directly with the least amount of "preparation" possible (e.g.
> converting
>     > to Parquet before working with the data). Read performance isn't a
> priority
>     > yet just as long as we can actually get to the data.
>     >
>     > I guess I'll port the app over to Java and try again with JDBC first.
>     >
>     > Kind regards,
>     > Gareth
>     >
>     > On 23/10/2020, 09:08, "Paul Rogers" <par0328@gmail.com> wrote:
>     >
>     >     Hi Gareth,
>     >
>     >     As it turns out, SELECT * by itself should use a fixed amount of
> memory
>     >     regardless of table size. (With two caveats.) Drill, as with
> most query
>     >     engines, reads data in batches, then returns each batch to the
> client.
>     > So,
>     >     if you do SELECT * FROM yourfile.csv, the execution engine will
> use
>     > only
>     >     enough memory for one batch of data (which is likely to be in
> the 10s
>     > of
>     >     meg in size.)
>     >
>     >     The first caveat is if you do a "buffering" operation, such as a
> sort.
>     >     SELECT * FROM yourfile.csv ORDER BY someCol will need to hold
> all data.
>     >     But, Drill spills to disk to relieve memory pressure.
>     >
>     >     The other caveat is if you use the REST API to fetch data.
> Drill's
>     > REST API
>     >     is not scalable. It buffers all data in memory in an extremely
>     > inefficient
>     >     manner. If you use the JDBC, ODBC or native APIs, then you won't
> have
>     > this
>     >     problem. (There is a pending fix we can do for a future
> release.) Are
>     > you
>     >     using the REST API?
>     >
>     >     Note that the above is just as true of Parquet as it is with CSV.
>     > However,
>     >     as Nitin notes, Parquet is more efficient to read.
>     >
>     >     Thanks,
>     >
>     >     - Paul
>     >
>     >
>     >     On Thu, Oct 22, 2020 at 11:30 PM Nitin Pawar <
> nitinpawar432@gmail.com>
>     >     wrote:
>     >
>     >     > Please convert CSV to parquet first and while doing so make
> sure you
>     > cast
>     >     > each column to correct datatype
>     >     >
>     >     > once you have in paraquet, your queries should be bit faster.
>     >     >
>     >     > On Fri, Oct 23, 2020, 11:57 AM Gareth Western <
>     > gareth@garethwestern.com>
>     >     > wrote:
>     >     >
>     >     > > I have a very large CSV file (nearly 13 million records)
> stored in
>     > Azure
>     >     > > Storage and read via the Azure Storage plugin. The drillbit
>     > configuration
>     >     > > has a modest 4GB heap size. Is there an effective way to
> select
>     > all the
>     >     > > records from the file without running out of resources in
> Drill?
>     >     > >
>     >     > > SELECT * … is too big
>     >     > >
>     >     > > SELECT * with OFFSET and LIMIT sounds like the right
> approach, but
>     > OFFSET
>     >     > > still requires scanning through the offset records, and this
> seems
>     > to hit
>     >     > > the same memory issues even with small LIMITs once the
> offset is
>     > large
>     >     > > enough.
>     >     > >
>     >     > > Would it help to switch the format to something other than
> CSV? Or
>     > move
>     >     > it
>     >     > > to a different storage mechanism? Or something else?
>     >     > >
>     >     >
>     >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message