drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-5282) Rationalize record batch sizes in all readers and operators
Date Tue, 21 Feb 2017 16:57:44 GMT
Paul Rogers created DRILL-5282:

             Summary: Rationalize record batch sizes in all readers and operators
                 Key: DRILL-5282
                 URL: https://issues.apache.org/jira/browse/DRILL-5282
             Project: Apache Drill
          Issue Type: Improvement
    Affects Versions: 1.10.0
            Reporter: Paul Rogers

Drill uses record batches to process data. A record batch consists of a "bundle" of vectors
that, combined, hold the data for some number of records.

The key consideration for a record batch is memory consumed. Various operators and readers
have vastly different ideas of the size of a batch. The text reader can produce batches of
100s of K, while the flatten operator produces batches of half a GB. Other operators are randomly
in between. Some readers produce batches of unlimited size driven by average row width.

Another key consideration is record count. Batches have a hard physical limit of 64K (the
number indexed by a two-byte selection vector.) Some operators produce this much, others far
less. In one case, we saw a reader that produced 64K+1 records.

A final consideration is the size of individual vectors. Drill incurs severe memory fragmentation
when vectors grow above 16 MB.

In some cases, operators (such as the Parquet reader) allocate large batches, but only partially
fill them, creating a large amount of wasted space. That space adds up when we must buffer
it during a sort.

This ticket asks to research an optimal batch size. Create a framework to build such batches.
Retrofit all operators that produce batches to use that framework to produce uniform batches.

This message was sent by Atlassian JIRA

View raw message