lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cao Manh Dat (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-9835) Create another replication mode for SolrCloud
Date Thu, 23 Feb 2017 02:09:44 GMT

    [ https://issues.apache.org/jira/browse/SOLR-9835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15879721#comment-15879721
] 

Cao Manh Dat commented on SOLR-9835:
------------------------------------

Thanks [~shalinmangar]!
bq. LeaderInitiatedRecoveryThread – What is the reason behind adding SocketTimeoutException
in the list of communication errors on which no more retries are made?
This change come from a jepsen test. This bug is also affect current mode. I created another
issue for this bug SOLR-9913. We can skip this change for this ticket.
bq.ZkController.register method – The condition for !isLeader && onlyLeaderIndexes
can be replaced by the isReplicaInOnlyLeaderIndexes variable.
Yeah, that's right
bq. Since there is no log replay on startup on replicas anymore, what if the replica is killed
(which keeps its state as 'active' in ZK) and then the cluster is restarted and the replica
becomes leader candidate? If we do not replay the discarded log then it could lead to data
loss?
Very good catch, I try to resolve this problem.
bq. UpdateLog – Can you please add javadocs outlining the motivation/purpose of the new
methods such as copyOverBufferingUpdates and switchToNewTlog e.g. why does switchToNewTlog
require copying over some updates from the old tlog?
Sure!
bq.It seems that any commits that might be triggered explicitly by the user can interfere
with the index replication. Suppose that a replication is in progress and a user explicitly
calls commit which is distributed to all replicas, in such a case the tlogs will be rolled
over and then when the ReplicateFromLeader calls switchToNewTlog(), the previous tlog may
not have all the updates that should have been copied over. We should have a way to either
disable explicit commits or protect against them on the replicas.
I don't think so, switchToNewTlog() is based on commit version at lucene index level ({{commit.getUserData().get(SolrIndexWriter.COMMIT_COMMAND_VERSION)}}),
so we will always roll over updates in right way.
bq.UpdateLog – why does copyOverBufferUpdates block updates while calling switchToNewTlog
but ReplicateFromLeader doesn't? How are they both safe?
Good catch I think we should blockUpdates in switchToNewTlog as well.
bq.Can we add tests for testing CDCR and backup/restore with this new replication scheme?
CDCR is very complex, I don't think we should support CDCR in this new replication mode now.
bq. ZkController.startReplicationFromLeader – Using a ConcurrentHashMap is not enough to
prevent two simultaneous replications from happening concurrently. You should use the atomic
putIfAbsent to put a core to the map before starting replication.
Yeah, that's sounds a good idea.
bq.Aren't some of the guarantees of real-time-get are relaxed in this new mode especially
around delete-by-queries which no longer apply on replicas? Can you please document them as
a comment on the issue that we can transfer to the ref guide in future?
I will update the ticket description now. Basically RTG is not consistency for DBQs

> Create another replication mode for SolrCloud
> ---------------------------------------------
>
>                 Key: SOLR-9835
>                 URL: https://issues.apache.org/jira/browse/SOLR-9835
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Cao Manh Dat
>            Assignee: Shalin Shekhar Mangar
>         Attachments: SOLR-9835.patch, SOLR-9835.patch, SOLR-9835.patch, SOLR-9835.patch,
SOLR-9835.patch, SOLR-9835.patch, SOLR-9835.patch, SOLR-9835.patch, SOLR-9835.patch, SOLR-9835.patch,
SOLR-9835.patch
>
>
> The current replication mechanism of SolrCloud is called state machine, which replicas
start in same initial state and for each input, the input is distributed across replicas so
all replicas will end up with same next state. 
> But this type of replication have some drawbacks
> - The commit (which costly) have to run on all replicas
> - Slow recovery, because if replica miss more than N updates on its down time, the replica
have to download entire index from its leader.
> So we create create another replication mode for SolrCloud called state transfer, which
acts like master/slave replication. In basically
> - Leader distribute the update to other replicas, but the leader only apply the update
to IW, other replicas just store the update to UpdateLog (act like replication).
> - Replicas frequently polling the latest segments from leader.
> Pros:
> - Lightweight for indexing, because only leader are running the commit, updates.
> - Very fast recovery, replicas just have to download the missing segments.
> On CAP point of view, this ticket will trying to promise to end users a distributed systems
:
> - Partition tolerance
> - Weak Consistency for normal query : clusters can serve stale data. This happen when
leader finish a commit and slave is fetching for latest segment. This period can at most {{pollInterval
+ time to fetch latest segment}}.
> - Consistency for RTG : just like original SolrCloud mode
> - Weak Availability : just like original SolrCloud mode. If a leader down, client must
wait until new leader being elected.
> To use this new replication mode, a new collection must be created with an additional
parameter {{liveReplicas=1}}
> {code}
> http://localhost:8983/solr/admin/collections?action=CREATE&name=newCollection&numShards=2&replicationFactor=1&liveReplicas=1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message