lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-5991) SolrCloud: Add API to move leader off a Solr instance
Date Thu, 17 Apr 2014 22:55:16 GMT

    [ https://issues.apache.org/jira/browse/SOLR-5991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13973533#comment-13973533
] 

Hoss Man commented on SOLR-5991:
--------------------------------

Off the cuff: it sounds like, what you'd really want for these types of usecases, is:

1) an "AVOID_RESPONSIBILITY" role which tells a node it should never participate in elections
-- either for shard leader, or for overseer.
2) per-node status info (from /admin/system) about whether this node is the overseer (SOLR-5823)
and/or hosts the leader of any shard 
3) a "forceelection" Collection API action (that takes an optional collection name and shard
name - so it can force overseer election, or leader election of all shards, or leader election
of a specific shard)
4) logic in CoreContainer.shutdown() that causes the node to do the following before finishing
a clean shutdown:
* act as if it has the AVOID_RESPONSIBILITY role (w/o updating it's actual zk state) until
completion of shutdown
* loop over it's current responsibilities and self-trigger the necessary "forceelection" commands
to elect someone else to take it's place sa overseer/shard-leader(s)

So...

* if you just want to reboot one node - you reboot that node, and instead of just acting like
it's droped off the face of the earth and potentially triggering elections when the ZK epheeral
nodes vanish, it poactively encourages an election first.
* If you want to shut down N machines permanently: you assign all of those N machines the
role "AVOID_RESPONSIBILITY" in advance, and then iterate over them shutting them down.  Ones
that had no responsibilities to begin with will shutdown fast, nodes that did have responsibilities
will shutdown slower as they force elections - but none of the other machines you are about
to shutdown will take on those responsibilities.
* If you want to reboot N machines with minimal down time: you can iterate over your N machines
checking their /admin/system response to see if they are the overseer or a shard leader --
if they are, you trigger the neccessary action=forceelection commands and wait for them to
complete.  when you are done, you should be able to shutdown/restart all N nodes very quickly,
and then remove the "AVOID_RESPONSIBILITY" role at your lesuire.


> SolrCloud: Add API to move leader off a Solr instance
> -----------------------------------------------------
>
>                 Key: SOLR-5991
>                 URL: https://issues.apache.org/jira/browse/SOLR-5991
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.7.1
>            Reporter: Rich Mayfield
>
> Common maintenance chores require restarting Solr instances.
> The process of a shutdown becomes a whole lot more reliable if we can proactively move
any leadership roles off of the Solr instance we are going to shut down. The leadership election
process then runs immediately.
> I am not sure what the semantics should be (either accomplishes the goal but one of these
might be best):
> * A call to tell a core to give up leadership (thus the next replica is chosen)
> * A call to specify which core should become the leader



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message