lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tomás Fernández Löbbe (JIRA) <>
Subject [jira] [Commented] (SOLR-10751) Master/Slave IndexVersion conflict
Date Fri, 22 Feb 2019 05:32:00 GMT


Tomás Fernández Löbbe commented on SOLR-10751:

I created a PR with #2, still WIP. In the PR, I only handle the version 0 case differently
for PULL replicas, however, [~caomanhdat] did something related for TLOG replicas. For the
TLOG, there is no commit, however, the replica opens a new searcher and updates the commit
point in the {{IndexFetcher}}. I'm guessing this is so that the TLOG replicas show 0 results
for the search, and also if it becomes the leader, the followers will replicate the empty
index from the leader. I'm wondering if for TLOG replicas we would want the same behavior
than PULLs actually, and no replication happening in the case of the version 0?
 [~caomanhdat], [~shalinmangar], your input would be great.
 As for testing, both {{TestPullReplica}} and {{TestTlogReplica}} are disabled with {{@AwaitsFix}}
at this point. I enabled {{TestPullReplica}} and It's in good shape. {{TestTlogReplica}} did
have many failures, I'm going to take a look at. {{ChaosMonkeyNothingIsSafeWithPullReplicasTest}}
is also looking better (1 failure after 1k runs, and it's an object leak that seems related
to this {{openNewSearcherAndUpdateCommitPoint}} code actually)

> Master/Slave IndexVersion conflict
> ----------------------------------
>                 Key: SOLR-10751
>                 URL:
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: 7.0
>            Reporter: Tomás Fernández Löbbe
>            Assignee: Tomás Fernández Löbbe
>            Priority: Major
>         Attachments: SOLR-10751.patch
>          Time Spent: 10m
>  Remaining Estimate: 0h
> I’ve been looking at some failures in the replica types tests. One strange failure
I noticed is, master and slave share the same version, but have different generation. The
IndexFetcher code does more or less this:
> {code}
> masterVersion = fetchMasterVersion()
> masterGeneration = fetchMasterGeneration()
> if (masterVersion == 0 && slaveGeneration != 0 && forceReplication) {
>    delete my index
>    commit locally
>    return
> } 
> if (masterVersion != slaveVersion) {
>   fetchIndexFromMaster(masterGeneration)
> } else {
>   //do nothing, master and slave are in sync.
> }
> {code}
> The problem I see happens with this sequence of events:
> delete index in master (not a DBQ=\*:\*, I mean a complete removal of the index files
and reload of the core)
> replication happens in slave (sees a version 0, deletes local index and commit)
> add document in master and commit
> if the commit in master and in the slave happen at the same millisecond*, they both end
up with the same version, but different indices. 
> I think that in addition of checking for the same version, we should validate that slave
and master have the same generation and If not, consider them not in sync, and proceed to
the replication.
> True, this is a situation that's difficult to happen in a real prod environment and it's
more likely to affect tests, but I think the change makes sense. 

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message