lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Miller (JIRA)" <>
Subject [jira] [Commented] (SOLR-10889) Stale zookeper information is used during failover check
Date Tue, 03 Oct 2017 15:34:00 GMT


Mark Miller commented on SOLR-10889:

SOLR-10397 has not landed yet - we should probably get the current implementation back into
shape, if not just to make sure the current testing is in good shape.

> Stale zookeper information is used during failover check
> --------------------------------------------------------
>                 Key: SOLR-10889
>                 URL:
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: 7.0
>            Reporter: Mihaly Toth
>            Assignee: Mark Miller
>         Attachments: SOLR-10889.patch
> In {{OverseerAutoReplicaFailoverThread}} it goes over each and every replica to check
if it needs to be reloaded on a new node. In each such round it reads cluster state just in
the beginning. Especially in case of big clusters, cluster state may change during the process
of iterating through the replicas. As a result false decisions may be made: restarting a healthy
core, or not handling a bad node.
> The code fragment in question:
> {code}
>         for (Slice slice : slices) {
>           if (slice.getState() == Slice.State.ACTIVE) {
>             final Collection<DownReplica> downReplicas = new ArrayList<DownReplica>();
>             int goodReplicas = findDownReplicasInSlice(clusterState, docCollection, slice,
> {code}
> The solution seems rather straightforward, reading the state every time:
> {code}
>             int goodReplicas = findDownReplicasInSlice(zkStateReader.getClusterState(),
docCollection, slice, downReplicas);
> {code}
> The only counter argument that comes into my mind is too frequent reading of the cluster
state. We can enhance this naive solution so that re-reading is done only if a bad node is
found. But I am not sure if such a read optimization is necessary.
> I have done some unit tests around this class, mocking out even the time factor. It runs
in a second. I am interested in getting feedback about such an approach. I will upload a patch
with this shortly.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message