lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Miller (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-10904) Unnecessary waiting during failover in case of failed core creation
Date Tue, 03 Oct 2017 15:36:00 GMT

    [ https://issues.apache.org/jira/browse/SOLR-10904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16189863#comment-16189863
] 

Mark Miller commented on SOLR-10904:
------------------------------------

[~mihaly.toth], do you have a patch for this one? I'd like to get any custom changes we have
to this externally back upstream so that I can get things back in shape and the nightly test
passing again. I'd feel a lot more comfortable flipping to the upcoming implementation with
the old one working properly again first.

> Unnecessary waiting during failover in case of failed core creation
> -------------------------------------------------------------------
>
>                 Key: SOLR-10904
>                 URL: https://issues.apache.org/jira/browse/SOLR-10904
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: 7.0
>            Reporter: Mihaly Toth
>            Assignee: Mark Miller
>
> Background failover thread checks for bad replicas. In case one is found it tries to
create it on another node. Then it waits for the new replica to show up in the cluster state.
It waits even if the core creation (initiated by itself) fails. 
> This situation does not occur on the happy path of the failover cases because the new
node was marked as alive. But in case the cluster is in an instable state, or user is restarting
the new node, or overseer is overloaded this extra wait will result in holding up this failover
thread.
> Proposed solution may be
> # wait for the result of the core creation
> # only if previous step is successful proceed to wait for cluster state change
> In code:
> {code}
> try {
>   Future<Boolean> future = updateExecutor.submit(() -> createSolrCore(collection,
createUrl, dataDir, ulogDir, coreNodeName, coreName, shardId));
>   future.get(30000L, TimeUnit.MILLISECONDS);
> } catch (InterruptedException | ExecutionException | TimeoutException e) {
>   log.error("Error creating core", e);
>   return false;
> } finally {
>   MDC.remove("OverseerAutoReplicaFailoverThread.createUrl");
> }
> {code}
> In such case we could consider moving core creation into the failover thread from the
updateExecutor.
> I can post a patch with these changes if the solution seems appropriate.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message