drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Timothy Farkas (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-6468) CatastrophicFailure.exit Should Not Call System.exit
Date Tue, 05 Jun 2018 20:48:00 GMT
Timothy Farkas created DRILL-6468:

             Summary: CatastrophicFailure.exit Should Not Call System.exit
                 Key: DRILL-6468
                 URL: https://issues.apache.org/jira/browse/DRILL-6468
             Project: Apache Drill
          Issue Type: Bug
            Reporter: Timothy Farkas
            Assignee: Timothy Farkas

Drill may never terminate in the event of a Heap OOM. When this happens we see stack traces
like the following:

"250387a7-363d-619c-d745-57ae50f19d15:frag:0:0" #104 daemon prio=10 os_prio=0 tid=0x00007fd9d1eec190
nid=0xd7d5 in Object.wait() [0x00007fd953de2000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Thread.join(Thread.java:1252)
        - locked <0x00000005c06bee28> (a org.apache.drill.exec.server.Drillbit$ShutdownThread)
        at java.lang.Thread.join(Thread.java:1326)
        at java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:106)
        at java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:46)
        at java.lang.Shutdown.runHooks(Shutdown.java:123)
        at java.lang.Shutdown.sequence(Shutdown.java:167)
        at java.lang.Shutdown.exit(Shutdown.java:212)
        - locked <0x00000005c1d8bb28> (a java.lang.Class for java.lang.Shutdown)
        at java.lang.Runtime.exit(Runtime.java:109)
        at java.lang.System.exit(System.java:971)
        at org.apache.drill.common.CatastrophicFailure.exit(CatastrophicFailure.java:49)
        at org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:246)
        at org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Here CatastrophicFailure.exit is being called when we encounter a Heap OOM. Then we call System.exit
to terminate the java process. The only issue is that System.exit run's Drill's normal shutdown
hook and tries to do a graceful shutdown. In the case of a Heap OOM we cannot do this reliable
because there physically isn't enough memory to proceed executing our code. The JVM likely
gets stuck a various places waiting on garbage collection and object allocations on the heap
and the Drillbit stops making progress.
*Improving Drill's Behavoir*

*Solution To Hanging Shutdown*

There are two kinds of OutOfMemory exceptions in Drill. Direct Memory OOMs and Heap OOMs.
Typically Direct Memory OOMs are recoverable because Drill uses Direct Memory to store data
only, so we can fail a query and lose data and recover. Heap OOMs are unrecoverable because
we actually need the Heap to execute our code, and if we can't use the heap then we basically
can't run our code reliably.

When Drill experiences a catastrophic failure we should not call System.exit because then
we will try to shutdown gracefully. In the event of a catastrophic failure like a Heap OOM
we cannot recover so we should forcefully terminate the jvm with Runtime.getRuntime().halt

This will make Drill shutdown promptly in the event of a Heap OOM.

This message was sent by Atlassian JIRA

View raw message