lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <Stephane.Lagrau...@ext.cdiscount.com>
Subject SolrCloud issues
Date Mon, 01 Feb 2016 17:14:17 GMT
Hello,

We are currently performing some benchmarks on Solr 5.4.0 and we hit some issues related to
SolrCloud and leading to recoveries and inconstancies.
Based on our tests, it seems that this version is less stable under pressure than our previously
installed 4.10.4 version.
We were able to mitigate the effects by increasing numRecordsToKeep in the update log and
limiting replication bandwidth.
But all problems were not resolved and more worrying it is more difficult to get back a running
cluster.
For example we ended up with a situation where on a shard the leader is down and all replicas
are active.

We found a particular pattern that leads to a bad cluster state, described here: https://issues.apache.org/jira/browse/SOLR-8129?focusedCommentId=15119905&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15119905

There are also a lot of open issues (or resolved in version 5.5) related to SolrCloud / Zookeeper
/ Replications.

Here is a (non exhaustive) list I could gather from JIRA:





SOLR-8129<https://issues.apache.org/jira/browse/SOLR-8129>


HdfsChaosMonkeyNothingIsSafeTest failures<https://issues.apache.org/jira/browse/SOLR-8129>

SOLR-8461<https://issues.apache.org/jira/browse/SOLR-8461>


CloudSolrStream and ParallelStream can choose replicas that are not active<https://issues.apache.org/jira/browse/SOLR-8461>

SOLR-8619<https://issues.apache.org/jira/browse/SOLR-8619>


A new replica should not become leader when all current replicas are down as it leads to data
loss<https://issues.apache.org/jira/browse/SOLR-8619>

SOLR-3274<https://issues.apache.org/jira/browse/SOLR-3274>


ZooKeeper related SolrCloud problems<https://issues.apache.org/jira/browse/SOLR-3274>

SOLR-6406<https://issues.apache.org/jira/browse/SOLR-6406>


ConcurrentUpdateSolrServer hang in blockUntilFinished.<https://issues.apache.org/jira/browse/SOLR-6406>

SOLR-8173<https://issues.apache.org/jira/browse/SOLR-8173> CLONE - Leader recovery process
can select the wrong leader if all replicas for a shard are down and trying to recover as
well as lose updates that should have been recovered.<https://issues.apache.org/jira/browse/SOLR-8173>
SOLR-8371<https://issues.apache.org/jira/browse/SOLR-8371>


Try and prevent too many recovery requests from stacking up and clean up some faulty logic.<https://issues.apache.org/jira/browse/SOLR-8371>

SOLR-7121<https://issues.apache.org/jira/browse/SOLR-7121>


Solr nodes should go down based on configurable thresholds and not rely on resource exhaustion<https://issues.apache.org/jira/browse/SOLR-7121>

SOLR-8586<https://issues.apache.org/jira/browse/SOLR-8586>


Implement hash over all documents to check for shard synchronization<https://issues.apache.org/jira/browse/SOLR-8586>



I wonder if all these issues could be treated in a general refactoring of this code instead
of individual patches for every issue.
I know that these issues are not easy to reproduce and debug and I'm not aware of all the
implications of this kind of work.
We are willing to contribute on this issues although our knowledge of Solr internal might
still be weak for such an important part of SolrCloud architecture.
We can provide logs and benchmarks that lead to inconsistencies and/or bad cluster states.
It appears with have a better behaviour when we have a 5 nodes zk cluster than a 3 nodes.
However there are no sign of any problems on ZK when we have these errors in Solr.

Regards,
Stephan



Mime
View raw message