uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jim Challenger (JIRA)" <...@uima.apache.org>
Subject [jira] [Created] (UIMA-2772) DUCC resource manager - Restart and fast-start
Date Mon, 25 Mar 2013 18:55:16 GMT
Jim Challenger created UIMA-2772:
------------------------------------

             Summary: DUCC resource manager - Restart and fast-start
                 Key: UIMA-2772
                 URL: https://issues.apache.org/jira/browse/UIMA-2772
             Project: UIMA
          Issue Type: Bug
          Components: DUCC
            Reporter: Jim Challenger
            Assignee: Jim Challenger


Currently RM waits a "reasonable time" (init-stabiity) on startup to allow nodes to check
in, before accepting scheduling requests.  It is not possible to know exactly how long to
wait, making init-stability a heuristic.  For normal startup this is not a problem.  If RM
is restarting 'hot', or if the orchestrator publishes non-preemptable jobs on restart, and
the necessary nodes have not arrived by the completion of init-stability wait, this can cause
many problems: over-commitment, under-commitment, and in some cases  inconsistent state (and
crashes).

To remedy this, RM will include the full Node object in its publications to the OR, which
will echo them back for work that it believes to be active. On startup RM can fully reconstruct
state as of its last publication from this, eliminating the problem. A side-effect of this
is that RM need not wait for nodes to check in, significantly decreasing its startup time.
 If nodes added to the resource pool in this way never check in, the normal "dead node" mechanism
will kick in, maintaining consistency.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message