lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mike Drob (JIRA)" <>
Subject [jira] [Commented] (SOLR-9836) Add more graceful recovery steps when failing to create SolrCore
Date Wed, 07 Dec 2016 23:17:58 GMT


Mike Drob commented on SOLR-9836:

bq. I'm not sure that is the right exception to catch - very brittle. We should probably be
mostly looking for CorruptedIndexException and if that doesn't cover a case at the Lucene
level, look at improving that there. Even if the case of a 0 byte segments file with nothing
to roll back on throws an EOFException today, it may not tomorrow. I think that is the goal
of the CorruptIndexException - you can actually have a little more than momentary confidence
that your code is not treating exceptions one way while things change underneath you over
I could add a check somewhere along the chain that would turn an {{EOF}} into a {{CorruptIndex}}.
However, I'm not confident enough in the lucene internals to know if this leads to eventual
false positives somewhere...  It probably looks like:
     long generation = generationFromSegmentsFileName(segmentFileName);
     //System.out.println(Thread.currentThread() + ": SegmentInfos.readCommit " + segmentFileName);
+    ChecksumIndexInput saved = null;
     try (ChecksumIndexInput input = directory.openChecksumInput(segmentFileName, IOContext.READ))
+      saved = input;
       return readCommit(directory, input, generation);
+    } catch (EOFException e) {
+      throw new CorruptIndexException("Unexpected end of file while reading index.", saved,

But the method javadoc worries me: {{* Read a particular segmentFileName.  Note that this
may throw an IOException if a commit is in process.}}
Under what circumstances would this throw an IOException? Randomly returning CorruptIndex
during normal operation is bad news.

> Add more graceful recovery steps when failing to create SolrCore
> ----------------------------------------------------------------
>                 Key: SOLR-9836
>                 URL:
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>            Reporter: Mike Drob
>         Attachments: SOLR-9836.patch
> I have seen several cases where there is a zero-length segments_n file. We haven't identified
the root cause of these issues (possibly a poorly timed crash during replication?) but if
there is another node available then Solr should be able to recover from this situation. Currently,
we log and give up on loading that core, leaving the user to manually intervene.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message