lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shalin Shekhar Mangar (JIRA)" <j...@apache.org>
Subject [jira] [Reopened] (SOLR-8069) Ensure that only the valid ZooKeeper registered leader can put a replica into Leader Initiated Recovery.
Date Thu, 22 Oct 2015 07:04:27 GMT

     [ https://issues.apache.org/jira/browse/SOLR-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Shalin Shekhar Mangar reopened SOLR-8069:
-----------------------------------------

There's a reproducible failure in the test added by SOLR-8075 caused by assertion error on
asserts added in this issue.

{code}
1 tests failed.
FAILED:  org.apache.solr.cloud.LeaderInitiatedRecoveryOnShardRestartTest.testRestartWithAllInLIR

Error Message:
Captured an uncaught exception in thread: Thread[id=43491, name=coreZkRegister-5997-thread-1,
state=RUNNABLE, group=TGRP-LeaderInitiatedRecoveryOnShardRestartTest]

Stack Trace:
com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception
in thread: Thread[id=43491, name=coreZkRegister-5997-thread-1, state=RUNNABLE, group=TGRP-LeaderInitiatedRecoveryOnShardRestartTest]
Caused by: java.lang.AssertionError
        at __randomizedtesting.SeedInfo.seed([7F78F76DDF75FAD1]:0)
        at org.apache.solr.cloud.ZkController.updateLeaderInitiatedRecoveryState(ZkController.java:2133)
        at org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:434)
        at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:197)
        at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:157)
        at org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:346)
        at org.apache.solr.cloud.ZkController.joinElection(ZkController.java:1113)
        at org.apache.solr.cloud.ZkController.register(ZkController.java:926)
        at org.apache.solr.cloud.ZkController.register(ZkController.java:881)
        at org.apache.solr.core.ZkContainer$2.run(ZkContainer.java:183)
        at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
{code}

The assertion is that leaderCd != null fails because ShardLeaderElectionContext.runLeaderProcess
calls ZkController.updateLeaderInitiatedRecoveryState with a null core descriptor  which is
by design because if you are marking a replica as 'active' then you don't necessarily need
to be a leader.

> Ensure that only the valid ZooKeeper registered leader can put a replica into Leader
Initiated Recovery.
> --------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-8069
>                 URL: https://issues.apache.org/jira/browse/SOLR-8069
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Mark Miller
>            Assignee: Mark Miller
>            Priority: Critical
>             Fix For: 5.4, Trunk
>
>         Attachments: SOLR-8069.patch, SOLR-8069.patch
>
>
> I've seen this twice now. Need to work on a test.
> When some issues hit all the replicas at once, you can end up in a situation where the
rightful leader was put or put itself into LIR. Even on restart, this rightful leader won't
take leadership and you have to manually clear the LIR nodes.
> It seems that if all the replicas participate in election on startup, LIR should just
be cleared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message