ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vyacheslav Koptilin (JIRA)" <j...@apache.org>
Subject [jira] [Created] (IGNITE-11253) When a node that is not part of the base topology joins the cluster, it may lead to a node failure.
Date Thu, 07 Feb 2019 23:01:00 GMT
Vyacheslav Koptilin created IGNITE-11253:
--------------------------------------------

             Summary: When a node that is not part of the base topology joins the cluster,
it may lead to a node failure.
                 Key: IGNITE-11253
                 URL: https://issues.apache.org/jira/browse/IGNITE-11253
             Project: Ignite
          Issue Type: Bug
    Affects Versions: 2.7
            Reporter: Vyacheslav Koptilin
            Assignee: Vyacheslav Koptilin
             Fix For: 2.8


* In case of eager TTL is configured, a starting node creates and starts {{cleanupWorker}}
(see {{GridCacheTtlManager.start0()}})
 * {{GridCacheSharedTtlCleanupManager.CleanupWorker}}, in its turn, has to wait for {{discovery().localJoin()}}
future that is completed by discovery thread.
 * On the other hand, the exchange thread stops cache contexts and, therefore, it stops the
\{{cleanupWorker}} as well.

 
{code:java}
org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager.stopCleanupWorker(GridCacheSharedTtlCleanupManager.java:109)
org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager.unregister(GridCacheSharedTtlCleanupManager.java:82)
org.apache.ignite.internal.processors.cache.GridCacheTtlManager.onKernalStop0(GridCacheTtlManager.java:110)
org.apache.ignite.internal.processors.cache.GridCacheManagerAdapter.onKernalStop(GridCacheManagerAdapter.java:111)
org.apache.ignite.internal.processors.cache.GridCacheProcessor.onKernalStop(GridCacheProcessor.java:1495)
org.apache.ignite.internal.processors.cache.GridCacheProcessor.onKernalStopCaches(GridCacheProcessor.java:1182)
org.apache.ignite.internal.processors.cache.GridCacheProcessor$CacheRecoveryLifecycle.onBaselineChange(GridCacheProcessor.java:5637)
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.initCachesOnLocalJoin(GridDhtPartitionsExchangeFuture.java:910)
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:792)

{code}
So, exchange thread may try to stop the {{cleanupWorker}} before the {{localJoin}} future
is completed by discovery thread.

Unfortunately, `cleanupWorker` incorrectly handles this situation, and this fact can lead
to a node failure:
{code:java}
Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeFailureHandler
[super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED,
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION,
err=class o.a.i.IgniteException: Got interrupted while waiting for future to complete.]]
class org.apache.ignite.IgniteException: Got interrupted while waiting for future to complete.
at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.localJoin(GridDiscoveryManager.java:2217)
at org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.body(GridCacheSharedTtlCleanupManager.java:136)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
at java.lang.Thread.run(Thread.java:748)
Caused by: class org.apache.ignite.internal.IgniteInterruptedCheckedException: Got interrupted
while waiting for future to complete.
at org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:186)
at org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141)
at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.localJoin(GridDiscoveryManager.java:2214)
... 3 more
{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message