lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SOLR-8914) inexplicable "no servers hosting shard: shard2" using MiniSolrCloudCluster
Date Tue, 29 Mar 2016 22:27:25 GMT

     [ https://issues.apache.org/jira/browse/SOLR-8914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hoss Man updated SOLR-8914:
---------------------------
    Attachment: live_node_mentions_port56361_with_threadIds.log.txt


Having read more of {{ZkStateReader}}, and understanding a bit better when/why these various
log messages are written, it now seems certain to me that ZkStateReader's callbacks to watching
{{live_nodes}} and updating that information in the (local) {{ClusterState}} is *absolutely
not thread safe*.

Key things to note about the ZkStateReader code related to live nodes and live nodes log msgs...

* The following are examples of log message formats that occur when a ZK {{Watcher}}'s process
method is triggered - in all of these cases the # of live nodes logged is the _current_ number,
according to local state, prior to processing the Watcher event...
** LegacyClusterStateWatcher (old multi-collections state.json)
*** {{"A cluster state change: \[\{\}\] for collection \[\{\}\] has occurred - updating...
(live nodes size: \[\{\}\])"}}
*** _NOTE: this is the current (local) {{liveNodes.size()}}, w/ no attempt to refresh {{liveNodes}}
from ZK_
** StateWatcher (per collection state.json)
*** {{"A cluster state change: \[\{\}\] for collection \[\{\}\] has occurred - updating...
(live nodes size: \[\{\}\])"}}
*** _NOTE: this is the current (local) {{liveNodes.size()}}, w/ no attempt to refresh {{liveNodes}}
from ZK_
** LiveNodeWatcher
*** {{"A live node change: \[\{\}\], has occurred - updating... (live nodes size: \[\{\}\])"}}
*** *NOTE: This the "old" value when a LiveNodesWatcher is triggered, prior to refreshing
from ZK*
* The following is what gets logged _after_ the local cache of the live nodes data has been
updated (either by an explicit method call to fetch the data from ZK on startup, or because
a LiveNodeWatcher was triggered by ZK)
** {{"Updated live nodes from ZooKeeper... (\{\}) -> (\{\})"}}
* The full code for how the local cache of {{liveNode}} is refreshed from ZK (either by an
explifit method call on startup or because a LiveNodeWatcher was triggered) is...{code}
try {
  List<String> nodeList = zkClient.getChildren(LIVE_NODES_ZKNODE, watcher, true);
  newLiveNodes = new HashSet<>(nodeList);
} catch (KeeperException.NoNodeException e) {
  newLiveNodes = emptySet();
}
Set<String> oldLiveNodes;
synchronized (getUpdateLock()) {
  oldLiveNodes = this.liveNodes;
  this.liveNodes = newLiveNodes;
  if (clusterState != null) {
    clusterState.setLiveNodes(newLiveNodes);
  }
}
LOG.info("Updated live nodes from ZooKeeper... ({}) -> ({})", oldLiveNodes.size(), newLiveNodes.size());
{code}


With all of that info in mind if we revisi the log file, and focus on the threadIds that log
each message for a single port we know is problematic (56361)...

{code}
grep -i "live nodes" jenkins.thetaphi.de_Lucene-Solr-6.x-Solaris_32.log.txt | grep 127.0.0.1:56361_solr
| perl -ple 's/.*2> (\d+).*?\((.*?-\d+)(?:-processing-n:127.*?)?\).*o.a.s.c.c.ZkStateReader(.*)/$1
$2 $3/' > live_node_mentions_port56361_with_threadIds.log.txt
{code}

...it becomes really easy to spot the problem...

{noformat}
1290004 jetty-launcher-1118-thread-5  Updated live nodes from ZooKeeper... (0) -> (0)
1290033 zkCallback-1135-thread-1  A live node change: [WatchedEvent state:SyncConnected type:NodeChildrenChanged
path:/live_nodes], has occurred - updating... (live nodes size: [0])
1290047 zkCallback-1135-thread-2  A live node change: [WatchedEvent state:SyncConnected type:NodeChildrenChanged
path:/live_nodes], has occurred - updating... (live nodes size: [0])
1290052 zkCallback-1135-thread-3  A live node change: [WatchedEvent state:SyncConnected type:NodeChildrenChanged
path:/live_nodes], has occurred - updating... (live nodes size: [0])
1290070 zkCallback-1135-thread-3  Updated live nodes from ZooKeeper... (0) -> (4)
1290071 zkCallback-1135-thread-3  A live node change: [WatchedEvent state:SyncConnected type:NodeChildrenChanged
path:/live_nodes], has occurred - updating... (live nodes size: [4])
1290071 zkCallback-1135-thread-3  Updated live nodes from ZooKeeper... (4) -> (5)
1290075 zkCallback-1135-thread-1  Updated live nodes from ZooKeeper... (5) -> (1)
1290076 zkCallback-1135-thread-2  Updated live nodes from ZooKeeper... (1) -> (3)
{noformat}

* timestamp #1290004 is when jetty is first starting up, and forcibly asks ZK for the list
of live nodes.
* we then see 3 threads being spun up in numeric sequence order, to (start) processing LiveNodeWatcher
events
* thread-3 finishes and updates the live nodes to a list of size "4"
* thread-3 is then re-used to (start) processing another LiveNodeWatcher event
* thread-3 finishes again and updates the live nodes to a list of size "5"
* thread-1 now finally finishes, and overwrites the live node list with a list of size "1"
*even though it's data is almost certainly old*
* thread-1 now finally finishes, and also overwrites the live node list with a list of size
"3" *even though this data is almost certainly also old*

Filling in the blanks, a valid sequence of events that fits the code and matches the logs
could have been (colors matchg T# thread id)...

* {color:blue}T1 {{process()}} called, logs "A live node change: ..."{color}
* {color:blue}T1 {{zkClient.getChildren(...)}} returns a list of 1 live nodes{color}
* {color:red}T2 {{process()}} called, logs "A live node change: ..."{color}
* {color:green}T3 {{process()}} called, logs "A live node change: ..."{color}
* {color:red}T2 {{zkClient.getChildren(...)}} returns a list of 3 live nodes{color}
* {color:green}T3 {{zkClient.getChildren(...)}} returns a list of 4 live nodes{color}
* {color:green}T3 {{clusterState.setLiveNodes("4")}} called, logs "Updated live nodes..."{color}
* {color:green}T3 {{process()}} is now finished, thread is free for re-use{color}
* {color:green}T3 {{process()}} called, logs "A live node change: ..."{color}
* {color:green}T3 {{zkClient.getChildren(...)}} returns a list of 5 live nodes{color}
* {color:green}T3 {{clusterState.setLiveNodes("5")}} called, logs "Updated live nodes..."{color}
* {color:green}T3 {{process()}} is now finished, thread is free for re-use{color}
* {color:blue}T1 {{clusterState.setLiveNodes("1")}} called, logs "Updated live nodes..."{color}
* {color:blue}T1 {{process()}} is now finished, thread is free for re-use{color}
* {color:red}T2 {{clusterState.setLiveNodes("3")}} called, logs "Updated live nodes..."{color}
* {color:red}T2 {{process()}} is now finished, thread is free for re-use{color}

_(NOTE: one thing i'm not clear on is why there are only 4 {{live_nodes}} NodeChildrenChanged
WatchedEvents triggered - i would expect 5 total unless ZK doesn't garuntee a unique event
for each new child node, does it send a single event if 2 children are added very close together
in time?)_

After the sequence of events above, port #56361 is hozed. In a real world situation, it will
never update it's local state to match the _real_ list of live nodes in ZK unless some other
node goes up/down triggering the LiveNodesWatcher.

The rest of the "live node" log messages in our log file are either passively reporting the
local cached data during other ZK events (specifically: collection creation trigging the state.json
watcher) or they are the triggers to the LiveNodeWatcher when MiniSolrCloudCluster starts
shutting down nodes after the test has failed.







> inexplicable "no servers hosting shard: shard2" using MiniSolrCloudCluster
> --------------------------------------------------------------------------
>
>                 Key: SOLR-8914
>                 URL: https://issues.apache.org/jira/browse/SOLR-8914
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Hoss Man
>         Attachments: jenkins.thetaphi.de_Lucene-Solr-6.x-Solaris_32.log.txt, live_node_mentions_port56361_with_threadIds.log.txt,
live_nodes_mentions.log.txt
>
>
> Jenkin's encountered a failure in TestTolerantUpdateProcessorCloud over the weekend....
> {noformat}
> http://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Solaris/32/consoleText
> Checking out Revision c46d7686643e7503304cb35dfe546bce9c6684e7 (refs/remotes/origin/branch_6x)
> Using Java: 64bit/jdk1.8.0 -XX:+UseCompressedOops -XX:+UseG1GC
> {noformat}
> The failure happened during the static setup of the test, when a MiniSolrCloudCluster
& several clients are initialized -- before any code related to TolerantUpdateProcessor
is ever used.
> I can't reproduce this, or really make sense of what i'm (not) seeing here in the logs,
so i'm filing this jira with my analysis in the hopes that someone else can help make sense
of it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message