uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jim Challenger (JIRA)" <...@uima.apache.org>
Subject [jira] [Commented] (UIMA-2593) RM: Resource Manager mishandling dead node with Work Items in Limbo
Date Tue, 28 May 2013 19:16:20 GMT

    [ https://issues.apache.org/jira/browse/UIMA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13668574#comment-13668574

Jim Challenger commented on UIMA-2593:

To reproduce: Started a job with about 1000 20 second work items.  52 nodes, 8 threads per
process.  The last 20-30 works items are quite long to insure the job doesn't end before the
problem shows up.  Run this job and wait for it to run down and start returning nodes.  Then
issue SIGSTOP to the  Agent on one of the live nodes with a process on it - RM will see it
go dead and allocate a new node to make up for it.  RM then allocates a new node pretty much
every scheduling cycle until everything is gone.  Not good!
> RM: Resource Manager mishandling dead node with Work Items in Limbo
> -------------------------------------------------------------------
>                 Key: UIMA-2593
>                 URL: https://issues.apache.org/jira/browse/UIMA-2593
>             Project: UIMA
>          Issue Type: Bug
>          Components: DUCC
>            Reporter: Jim Challenger
>            Assignee: Jim Challenger
>             Fix For: 1.0-Ducc
> If a node dies with a work-item that is starting but not confirmed so it goes into Limbo,
RM continuously allocates a new node until the pool is exhausted.
> Correct behavior is for RM to allocate only sufficient nodes to make up for the dead
one, based on remaining work.
> To reproduce, start a small cluster and fire off a job with a couple hundred short (5-10
second) work items.  Once all nodes are full issue SIGSTOP to one agent and JP.  This should
cause at least one WI to go into limbo.  When the heartbeat counter says the node is dead
we expect to see the errant behavior start.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message