lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shalin Shekhar Mangar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-6640) ChaosMonkeySafeLeaderTest failure with CorruptIndexException
Date Mon, 29 Dec 2014 14:42:13 GMT

    [ https://issues.apache.org/jira/browse/SOLR-6640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14260134#comment-14260134
] 

Shalin Shekhar Mangar commented on SOLR-6640:
---------------------------------------------

I had a discussion with Varun about this issue. We have two problems here
# Solr corrupts the index during replication recovery
# Such a corrupt index puts Solr into an infinite recovery loop

For #1 the problem is clear -- we have open searchers on uncommitted/flushed files which are
mixed with files from the leader causing corruption.

Possible solutions for #1 are either a) switch to a different index dir and move/copy files
from committed segments and use the index.properties approach to open a searcher on the new
index dir or b) close the searcher then rollback the writer and then download the necessary
files.

Closing the searcher.... is not as simple as it sounds because the searcher is ref counted
and close() doesn't really close immediately. Also, at any time, a request might open a new
searcher so it is a very involved change.

For #2, every where we open a reader/searcher or writer, we should be ready to handle the
corrupt index exceptions.

I think we should first try to first solve the problem of corrupting the index. So let's try
the deletion approach that Varun outlined. If that fails then we should switch to a new index
dir, move/copy over files from commit points, fetch the missing segments from the leader and
use the index.properties approach to completely move to a new index directory.

The second problem that we need to solve is that a corrupted index trashes the server. We
should be able to recover from such a scenario instead of going into an infinite recovery
loop.

Let's fix these two problems (in that order) and then figure out ways to optimize recovery.

Longer term we need to change our code such that we can close the searchers, rollback the
writer and delete uncommitted files and then attempt replication recovery.

Also my earlier comment on non-cloud Solr was wrong:
bq. In SolrCloud we could just close the searcher before rollback because a replica in recovery
won't get any search requests but that's not practical in standalone Solr because it'd cause
downtime.

In stand alone Solr this is not a problem because indexing and soft-commits do not happen
on slaves. But anyway changing to close the searcher etc is a big change.

> ChaosMonkeySafeLeaderTest failure with CorruptIndexException
> ------------------------------------------------------------
>
>                 Key: SOLR-6640
>                 URL: https://issues.apache.org/jira/browse/SOLR-6640
>             Project: Solr
>          Issue Type: Bug
>          Components: replication (java)
>    Affects Versions: 5.0
>            Reporter: Shalin Shekhar Mangar
>             Fix For: 5.0
>
>         Attachments: Lucene-Solr-5.x-Linux-64bit-jdk1.8.0_20-Build-11333.txt, SOLR-6640.patch,
SOLR-6640.patch
>
>
> Test failure found on jenkins:
> http://jenkins.thetaphi.de/job/Lucene-Solr-5.x-Linux/11333/
> {code}
> 1 tests failed.
> REGRESSION:  org.apache.solr.cloud.ChaosMonkeySafeLeaderTest.testDistribSearch
> Error Message:
> shard2 is not consistent.  Got 62 from http://127.0.0.1:57436/collection1lastClient and
got 24 from http://127.0.0.1:53065/collection1
> Stack Trace:
> java.lang.AssertionError: shard2 is not consistent.  Got 62 from http://127.0.0.1:57436/collection1lastClient
and got 24 from http://127.0.0.1:53065/collection1
>         at __randomizedtesting.SeedInfo.seed([F4B371D421E391CD:7555FFCC56BCF1F1]:0)
>         at org.junit.Assert.fail(Assert.java:93)
>         at org.apache.solr.cloud.AbstractFullDistribZkTestBase.checkShardConsistency(AbstractFullDistribZkTestBase.java:1255)
>         at org.apache.solr.cloud.AbstractFullDistribZkTestBase.checkShardConsistency(AbstractFullDistribZkTestBase.java:1234)
>         at org.apache.solr.cloud.ChaosMonkeySafeLeaderTest.doTest(ChaosMonkeySafeLeaderTest.java:162)
>         at org.apache.solr.BaseDistributedSearchTestCase.testDistribSearch(BaseDistributedSearchTestCase.java:869)
> {code}
> Cause of inconsistency is:
> {code}
> Caused by: org.apache.lucene.index.CorruptIndexException: file mismatch, expected segment
id=yhq3vokoe1den2av9jbd3yp8, got=yhq3vokoe1den2av9jbd3yp7 (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/mnt/ssd/jenkins/workspace/Lucene-Solr-5.x-Linux/solr/build/solr-core/test/J0/temp/solr.cloud.ChaosMonkeySafeLeaderTest-F4B371D421E391CD-001/tempDir-001/jetty3/index/_1_2.liv")))
>    [junit4]   2> 		at org.apache.lucene.codecs.CodecUtil.checkSegmentHeader(CodecUtil.java:259)
>    [junit4]   2> 		at org.apache.lucene.codecs.lucene50.Lucene50LiveDocsFormat.readLiveDocs(Lucene50LiveDocsFormat.java:88)
>    [junit4]   2> 		at org.apache.lucene.codecs.asserting.AssertingLiveDocsFormat.readLiveDocs(AssertingLiveDocsFormat.java:64)
>    [junit4]   2> 		at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:102)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message