lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mike Drob (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-10525) Stacked recovery requests can interfere with one another
Date Thu, 20 Apr 2017 00:30:04 GMT

    [ https://issues.apache.org/jira/browse/SOLR-10525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15975807#comment-15975807
] 

Mike Drob commented on SOLR-10525:
----------------------------------

It occurs to me that with this change if there is a steady stream of recovery requests then
each one will start up and get cancelled. I don't have a good grasp on how heavy the start
cost here is, it looks like mostly checking the update log in RecoveryStrategy.

The previous behaviour would be to run a full recovery and then run another full recovery.
If failures were still happening at that point, then we could queue up another recovery. So
we're not impacting availability when the root cause hasn't been fixed.

It does get better if the root cause is more temporal, since the final recovery can start
as soon as the problem is fixed, instead of waiting for another recovery to finish.

I think I've convinced myself to not worry about the churn, but would like a second opinion.

> Stacked recovery requests can interfere with one another
> --------------------------------------------------------
>
>                 Key: SOLR-10525
>                 URL: https://issues.apache.org/jira/browse/SOLR-10525
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>            Reporter: Mike Drob
>         Attachments: SOLR-10525.patch
>
>
> https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/update/DefaultSolrCoreState.java#L300-L310
> Two issues with this code:
> {code}
>           boolean locked = recoveryLock.tryLock();
>           try {
>             if (!locked) {
>               if (recoveryWaiting.get() > 0) { // line 1
>                 return;
>               }
>               recoveryWaiting.incrementAndGet(); // line 2
>             } else {
>               recoveryWaiting.incrementAndGet();
>               cancelRecovery(); // line 3
> }
> {code}
> The {{cancelRecovery}} on line 3 call will only hit when there are no recoveries to actually
cancel (since we got the lock that means there are no recoveries in progress). Instead it
should be moved either to the either branch of the if, or outside after the if since we know
we will be running a recovery at that point.
> This code doesn't always prevent multiple requests from stacking. If there is a recovery
running, but no recoveries currently waiting, multiple requests can check the count at line
1 before any of them will increment the count at line 2 and thus all of them will hit the
increment.
> I don't have specific tests for this, but it's causing failures for me on my SOLR-9555
work in progress.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message