drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-5211) External sort fails to allocate merge memory when plenty is free
Date Sun, 22 Jan 2017 00:21:26 GMT
Paul Rogers created DRILL-5211:

             Summary: External sort fails to allocate merge memory when plenty is free
                 Key: DRILL-5211
                 URL: https://issues.apache.org/jira/browse/DRILL-5211
             Project: Apache Drill
          Issue Type: Bug
            Reporter: Paul Rogers
            Assignee: Paul Rogers
             Fix For: 1.9.0

Consider a test of the external sort as follows:

* Direct memory: 3GB
* Input file: 18 GB, with one Varchar column of 8K width

The sort runs, spilling to disk. Once all data arrives, the sort beings to merge the results.
But, to do that, it must first do an intermediate merge. For example, in this sort, there
are 190 spill files, but only 19 can be merged at a time. (Each merge file contains 128 MB
batches, and only 19 can fit in memory, giving a total footprint of 2.5 GB, well below the
3 GB limit.

Yet, when loading batch xx, Drill fails with an OOM error. At that point, total available
direct memory is 3,817,865,216. (Obtained from {{maxMemory}} in the {{Bits}} class in the

It appears that Drill wants to allocate 58,257,868 bytes, but the {{totalCapacity}} (again
in {{Bits}}) is already 3,800,769,206, causing an OOM.

The problem is that, at this point, the external sort should not ask the system for more memory.
The allocator for the external sort is at just 1,192,350,366 before the allocation request.
Plenty of spare memory should be available, released when the in-memory batches were spilled
to disk prior to merging. Indeed, earlier in the run, the sort had reached a peak memory usage
of 2,710,716,416 bytes. This memory should be available for reuse during merging, and is plenty
sufficient to fill the particular request in question.

May be a coincidence, but in a different run, the OOM occurs once memory hits 1,310,154,570.
That memory, in hex is 0x4E175F4A, which, in a 32-bit int, is negative. Might some bit of
code be using an int when it should use a long?

This message was sent by Atlassian JIRA

View raw message