lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Timothy Potter (JIRA)" <>
Subject [jira] [Updated] (SOLR-5468) Option to enforce a majority quorum approach to accepting updates in SolrCloud
Date Thu, 01 May 2014 16:23:16 GMT


Timothy Potter updated SOLR-5468:

    Attachment: SOLR-5468.patch

Here is a patch that should be classified as exploratory / discovery into this topic. It has
a little overlap with the patch I posted for SOLR-5495, but not to worry on that as I plan
to commit SOLR-5495 first so the overlap will get worked out shortly.

As I dug into this idea in more detail, it became pretty clear that what we can accomplish
in the area of providing stronger enforcement of replication during update processing is fairly
limited by our architecture. Of course this is not a criticism of the current architecture
as I feel it’s fundamentally sound.

The underlying concept in this feature is a client application wants to ensure a write succeeds
on more than just the leader. For instance in a collection with RF=3, the client may want
to say don’t consider an update request is successful unless it succeeds on 2 of the 3 replicas,
vs. how it works today is an update request is considered successful if it succeeds on the
leader only. The problem is that there’s no consistent way to “back out” an update without
some sort of distributed transaction coordination among the replicas, which I’m pretty sure
we don’t want to even go down that road. Backing out an update seems doable (potentially)
if we’re just talking about the leader but what happens when the client wants minRF=3 and
the update only works on the leader and one of the replicas? Now we’re needing to back out
an update from the leader and one of the replicas. Gets ugly fast …

So what is accomplished in this patch? First, a client application has the ability to request
information back from the cluster on what replication factor was achieved for an update request
by sending the min_rf parameter in the request. This is the hint to the leader to keep track
of the success or failure of the request on each replica. As that implies some waiting to
see the result, the client can also send the max_rfwait parameter that tells the leader how
long it should wait to collect the results from the replicas (default is 5 seconds). This
is captured in the ReplicationFactorTest class.

This can be useful for client applications that have idempotent updates and thus decide to
retry the updates if the desired replication factor was not achieved. What we can’t do is
fail the request if the desired min_rf is not achieved as that leads to the aforementioned
backing out issues. There is one case where we can fail the request and avoid the backing
out issue is if we know the min_rf can’t be achieved before we do the write locally on the
leader first. This patch doesn’t have that solution in place as I wasn’t sure if that’s
desired? If so, it will be easy to add that. Also, this patch doesn’t have anything in place
for batch processing, ie. only works with single update requests as I wanted to get some feedback
before going down that path any further. Moreover, there’s a pretty high cost in terms of
slowing down update request processing in SolrCloud by having the leader block until it knows
the result of the request on the replicas. In other words, this functionality is not for free
but may still be useful for some applications?

To me, at the end of the day, what’s really needed is to ensure that any update requests
that were ack’d back to the client are not lost. This could happen under the current architecture
if the leader is the only replica that has a write and then fails and doesn’t recover before
another replica recovers and resumes the leadership role (after waiting too long to see the
previous leader come back). Thus, from where I sit, our efforts are better spent on continuing
to harden the leader failover and recovery processes and applications needing stronger guarantees
should have more replicas. SolrCloud should then just focus on only allowing in-sync replicas
to become the leader using strategies like what was provided with SOLR-5495.

> Option to enforce a majority quorum approach to accepting updates in SolrCloud
> ------------------------------------------------------------------------------
>                 Key: SOLR-5468
>                 URL:
>             Project: Solr
>          Issue Type: New Feature
>          Components: SolrCloud
>    Affects Versions: 4.5
>         Environment: All
>            Reporter: Timothy Potter
>            Assignee: Timothy Potter
>            Priority: Minor
>         Attachments: SOLR-5468.patch
> I've been thinking about how SolrCloud deals with write-availability using in-sync replica
sets, in which writes will continue to be accepted so long as there is at least one healthy
node per shard.
> For a little background (and to verify my understanding of the process is correct), SolrCloud
only considers active/healthy replicas when acknowledging a write. Specifically, when a shard
leader accepts an update request, it forwards the request to all active/healthy replicas and
only considers the write successful if all active/healthy replicas ack the write. Any down
/ gone replicas are not considered and will sync up with the leader when they come back online
using peer sync or snapshot replication. For instance, if a shard has 3 nodes, A, B, C with
A being the current leader, then writes to the shard will continue to succeed even if B &
C are down.
> The issue is that if a shard leader continues to accept updates even if it loses all
of its replicas, then we have acknowledged updates on only 1 node. If that node, call it A,
then fails and one of the previous replicas, call it B, comes back online before A does, then
any writes that A accepted while the other replicas were offline are at risk to being lost.

> SolrCloud does provide a safe-guard mechanism for this problem with the leaderVoteWait
setting, which puts any replicas that come back online before node A into a temporary wait
state. If A comes back online within the wait period, then all is well as it will become the
leader again and no writes will be lost. As a side note, sys admins definitely need to be
made more aware of this situation as when I first encountered it in my cluster, I had no idea
what it meant.
> My question is whether we want to consider an approach where SolrCloud will not accept
writes unless there is a majority of replicas available to accept the write? For my example,
under this approach, we wouldn't accept writes if both B&C failed, but would if only C
did, leaving A & B online. Admittedly, this lowers the write-availability of the system,
so may be something that should be tunable?
> From Mark M: Yeah, this is kind of like one of many little features that we have just
not gotten to yet. I’ve always planned for a param that let’s you say how many replicas
an update must be verified on before responding success. Seems to make sense to fail that
type of request early if you notice there are not enough replicas up to satisfy the param
to begin with.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message