lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-7141) RecoveryStrategy: Raise time that we wait for any updates from the leader before they saw the recovery state to have finished.
Date Sat, 18 Apr 2015 23:08:59 GMT

    [ https://issues.apache.org/jira/browse/SOLR-7141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14501622#comment-14501622
] 

Yonik Seeley commented on SOLR-7141:
------------------------------------

It's tricky ;-)

>From memory, here's how it's supposed to work:
1. replica tells leader it want's to recover
2. leader starts forwarding updates to replica (which the replica buffers since it's in recovery)
3. leader executes a hard commit (so replica can replicate the current index)
4. replica starts replicating index from the last leader commit point

Note that the ordering of #2 and #3 is very important.  If we did #3 first and then #2 after,
some updates won't make it into the commit and also won't be forwarded to the replica (and
that leads to data loss).

Now the issue: even though we do #2 first and #3 after... it's possible to have an unfortunately
scheduled update in a different thread that started before we did #2, and doesn't complete
until after #3, so that update was not forwarded, and it's also not in the replicated index.
 The sleep (which should be between steps #2 and #3) is to try and give time for this update
to complete and make it into the index.

It occurs to me that the lucene IndexWriter thread stealing (same issue that caused this:
SOLR-6820) could make this much more likely than we would have thought.

One possible alternative is to block updates for a commit of this type (replication commit).
 Any blocked updates would need to see that they need to be forwarded to the replica too (once
they are unblocked) - I don't know if the code is currently written that way.

> RecoveryStrategy: Raise time that we wait for any updates from the leader before they
saw the recovery state to have finished.
> ------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-7141
>                 URL: https://issues.apache.org/jira/browse/SOLR-7141
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Mark Miller
>            Assignee: Mark Miller
>             Fix For: Trunk, 5.1
>
>         Attachments: SOLR-7141.patch
>
>
> The current wait of 3 seconds is pushing the envelope a bit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message