uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jerry Cwiklik (JIRA)" <...@uima.apache.org>
Subject [jira] [Commented] (UIMA-5794) DUCC: Agent fails to stop processes
Date Fri, 15 Jun 2018 15:00:00 GMT

    [ https://issues.apache.org/jira/browse/UIMA-5794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16513927#comment-16513927

Jerry Cwiklik commented on UIMA-5794:

I was able to reproduce the problem. Looks like FailedInitialization in the POP (plain old
process) is exposing a bug in the agent. When initialization fails and POP keeps running,
an agent is not killing the process. After a while, the RM decides to purge such process.
An agent receives a new state from OR, finds that its inventory has a running process which
is not in the OR state. 

The JP is immune to this bug since it calls System.exit() when there is an exception during
initialization. Agent's job is to cleanup processes which it clearly fails to do for POPs. 

> DUCC: Agent fails to stop processes
> -----------------------------------
>                 Key: UIMA-5794
>                 URL: https://issues.apache.org/jira/browse/UIMA-5794
>             Project: UIMA
>          Issue Type: Bug
>          Components: DUCC
>            Reporter: Jerry Cwiklik
>            Assignee: Jerry Cwiklik
>            Priority: Major
>             Fix For: 2.2.3-Ducc
> Agent does not stop running processes sometimes. In a specific case, the agent left a
few processes running even though these processes state was set to Stopping.
> [Process Type=Pop DUCC ID=348 PID=17099 State=Stopping Resident Memory=361656320 GC Total=-1
GC Time=-1 Init Stats List Size:0 Reason: JPHasNoActiveJob] Exit Code=0
>  [Process Type=Pop DUCC ID=364 PID=593 State=Stopping Resident Memory=7382974464 GC Total=-1
GC Time=-1 Init Stats List Size:0 Reason: JPHasNoActiveJob] Exit Code=0
> For some reason Agent failed to send SIGKILL after SIGTERM failed to stop them. Since
these processes used a lot of memory, the OS killer ended up killing legit processes to keep
the node from running out of memory.
> Since agent logs wrapped the evidence of what happened has been lost.
> Modify agent to keep sending SIGKILL to processes in Stopping state after some time lapses.
Perhaps rogue process detector can be tasked with that.

This message was sent by Atlassian JIRA

View raw message