lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF subversion and git services (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-12729) SplitShardCmd should lock the parent shard to prevent parallel splitting requests
Date Mon, 22 Oct 2018 10:40:00 GMT

    [ https://issues.apache.org/jira/browse/SOLR-12729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658865#comment-16658865
] 

ASF subversion and git services commented on SOLR-12729:
--------------------------------------------------------

Commit 90c1804131108091c06dc50ccc3e4ed72c2a854d in lucene-solr's branch refs/heads/branch_7x
from Andrzej Bialecki
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=90c1804 ]

SOLR-12729: Unlock the shard on error.


> SplitShardCmd should lock the parent shard to prevent parallel splitting requests
> ---------------------------------------------------------------------------------
>
>                 Key: SOLR-12729
>                 URL: https://issues.apache.org/jira/browse/SOLR-12729
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: AutoScaling
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>            Priority: Major
>             Fix For: 7.6, master (8.0)
>
>
> This scenario was discovered by the simulation framework, but it exists also in the non-simulated
code.
> When {{IndexSizeTrigger}} requests SPLITSHARD, which is then successfully started and
“completed” from the point of view of {{ExecutePlanAction}}, the reality is that it still
can take significant amount of time until the moment when the new replicas fully recover and
cause the switch of shard states (parent to INACTIVE, child from RECOVERY to ACTIVE).
> If this time is longer than the trigger's {{waitFor}} the trigger will issue the same
SPLITSHARD request again. {{SplitShardCmd}} doesn't prevent this new request from being processed
because the parent shard is still ACTIVE. However, a section of the code in {{SplitShardCmd}}
will realize that sub-slices with the target names already exist and they are not active,
at which point it will delete the new sub-slices ({{SplitShardCmd:182}}).
> The end result is an infinite loop, where {{IndexSizeTrigger}} will keep generating SPLITSHARD,
and {{SplitShardCmd}} will keep deleting the recovering sub-slices created by the previous
command.
> A simple solution is for the parent shard to be marked to indicate that it’s in a process
of splitting, so that no other split is attempted on the same shard. Furthermore, {{IndexSizeTrigger}}
could temporarily exclude such shards from monitoring.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message