lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shalin Shekhar Mangar (JIRA)" <>
Subject [jira] [Updated] (SOLR-6530) Commits under network partition can put any node in down state by any node
Date Fri, 19 Sep 2014 06:12:35 GMT


Shalin Shekhar Mangar updated SOLR-6530:
    Attachment: SOLR-6530.patch

Here's a better fix which uses the global (ZK) state instead of the local before executing
the LIR code. From my reading of the code, the local isLeader variable in CloudDescriptor
is not unset in all cases.

> Commits under network partition can put any node in down state by any node
> --------------------------------------------------------------------------
>                 Key: SOLR-6530
>                 URL:
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>            Reporter: Shalin Shekhar Mangar
>            Priority: Critical
>             Fix For: 5.0, 6.0
>         Attachments: SOLR-6530.patch, SOLR-6530.patch, SOLR-6530.patch
> Commits are executed by any node in SolrCloud i.e. they're not routed via the leader
like other updates. 
> # Suppose there's 1 collection, 1 shard, 2 replicas (A and B) and A is the leader
> # Suppose a commit request is made to node B during a time where B cannot talk to A due
to a partition for any reason (failing switch, heavy GC, whatever)
> # B fails to distribute the commit to A (times out) and asks A to recover
> # This was okay earlier because a leader just ignores recovery requests but with leader
initiated recovery code, B puts A in the "down" state and A can never get out of that state.
> tl;dr; During network partitions, if enough commit/optimize requests are sent to the
cluster, all the nodes in the cluster will eventually be marked as "down".

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message