ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexey Goncharuk (JIRA)" <j...@apache.org>
Subject [jira] [Created] (IGNITE-3616) Drop failed nodes from topology after a configured timeout
Date Mon, 01 Aug 2016 07:21:20 GMT
Alexey Goncharuk created IGNITE-3616:
----------------------------------------

             Summary: Drop failed nodes from topology after a configured timeout
                 Key: IGNITE-3616
                 URL: https://issues.apache.org/jira/browse/IGNITE-3616
             Project: Ignite
          Issue Type: Improvement
          Components: cache
    Affects Versions: 1.5.0.final
            Reporter: Alexey Goncharuk


If an OOME or assertion happens on a node, it is not uncommon that partition exchange is stuck
blocking the whole cluster. We should provide a mechanism to drop non-responsive nodes automatically.

When partition exchange is times out, a coordinator should:
- print out IDs/IPs of non-responsive nodes at all times
- introduce a certain kill timeout for non-responsive nodes (-1 means
disabled)
- the timeout should be at least a minute after the 1st non-responsive node
message is printed
- when the timeout expires, we should kill the nodes and automatically
collect their thread dumps (do best effort for it)
- we should print out a message asking users to provide these thread dumps to us via Jira
or dev list



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message