lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <>
Subject [jira] [Commented] (SOLR-8914) inexplicable "no servers hosting shard: shard2" using MiniSolrCloudCluster
Date Tue, 29 Mar 2016 01:53:25 GMT


Hoss Man commented on SOLR-8914:

Nice catch Scott!

FWIW, here's a one liner that pulls out all the "live nodes size" messages from the log and
distills it down to just the timeestamp, threadIds (showing the port# of the jetty instance)
and the current live node count...

grep "live nodes size" jenkins.thetaphi.de_Lucene-Solr-6.x-Solaris_32.log.txt | perl -ple
's/.*2> (\d+).*?\((.*?)\).*?o.a.s.c.c.ZkStateReader.*?live nodes size: (\[\d+\]).*/$1 $2

using the output from that command as a refrence point, we can revist some of my earlier observations...

* NOTE: there are 5 nodes in this cluster, but only 2 shards x 2 replicas
* L122: collection creation
** when the 4 cores for the collection are created, the jetty node that does _NOT_ recieve
any core is port #63099...{noformat}
1291030 INFO  (OverseerThreadFactory-3272-thread-1-processing-n: [n:
   ] o.a.s.c.OverseerCollectionMessageHandler Creating core test_col_shard1_replica1 as part
of shard shard1 of collection test_col on
1291031 INFO  (OverseerThreadFactory-3272-thread-1-processing-n: [n:
   ] o.a.s.c.OverseerCollectionMessageHandler Creating core test_col_shard2_replica2 as part
of shard shard2 of collection test_col on
1291038 INFO  (OverseerThreadFactory-3272-thread-1-processing-n: [n:
   ] o.a.s.c.OverseerCollectionMessageHandler Creating core test_col_shard2_replica1 as part
of shard shard2 of collection test_col on
1291041 INFO  (OverseerThreadFactory-3272-thread-1-processing-n: [n:
   ] o.a.s.c.OverseerCollectionMessageHandler Creating core test_col_shard1_replica2 as part
of shard shard1 of collection test_col on
* L129: call to AbstractDistribZkTestBase.waitForRecoveriesToFinish 
** this happens between timestamp #1302667 - #1302668
** the most recent flurry of log messages mentioning "live nodes size" prior to this timestamp
window are...{noformat}
1301787 zkCallback-1136-thread-1-processing-n: [5]
1301788 zkCallback-1142-thread-2-processing-n: [5]
1301791 zkCallback-1143-thread-1-processing-n: [5]
1301791 zkCallback-1135-thread-2-processing-n: [3]
** NOTE: these log messages are watches on {{path:/collections/test_col/state.json}} which
port#63099 has no reason to watch (since it's not hosting any shards of {{test_col}} so it's
not unusual that it's not logging anything around these times
** {color:red}why does port #56361 think at this point that there are only 3 live nodes?{color}
* L184 - L188: adding docs
** this is done with CloudSolrClient, so each doc goes to it's respective leader regardless
of wether any 1 node has a broken view of live_nodes
** port #56361 hosts test_col_shard1_replica1 which is aparently the leader of shard1 at this
** the shard1 doc is forwarded correctly to test_col_shard1_replica2 (on port #34275)
*** so #34275 is evidently in port #56361's "view" of live_nodes, and _not_ one of the "missing"
* L189: commit
** happens during timestamps #1302724 - #1302733
** (no new "live nodes size" log messages prior to this time window since the ones mentioned
** IIUC CloudSolrClient correctly, this is considered a "non routable" update, so it will
be sent to a (leader) node picked at random
** from the logs, it looks like CloudSolrClient picked port #56361
** port #56361's test_col_shard1_replica1 forwarded to it's replica test_col_shard1_replica2
on port #34275
** as mentioned before shard2 never got the update - aparently because port #56361 didn't
think that shard2's leader was in live_nodes
* L190: the {{\*:\*} query
** timestamp #1302737 is when port #56361 (who we know has a bogus view of live_nodes) logs
that no servesr are hosting shard2, and returns a 503 error code
** (there are still no new "live nodes size" log messages prior to this error since the ones
mentioned previously)
** Something that didn't occur to me before, is that IIRC CloudSolrServertimestamp will retry
some error codes -- i'm guessing 503 is one of them, because that would explain the subsequent
log messages 5ms later...
** timestamps #1302742 - 1302746, it appears that port #34275 was queried by CloudSolrClient
on retry, and it did know that both replicas of shard2 were alive and queried them (but since
they never got the commit, they had numFound=0){noformat}
1302742 INFO  (qtp316491708-7357) [n: c:test_col s:shard2 r:core_node2
x:test_col_shard2_replica2] o.a.s.c.S.Request [test_col_shard2_replica2]  webapp=/solr path=/select
hits=0 status=0 QTime=0
1302742 INFO  (qtp1409631559-7346) [n: c:test_col s:shard1 r:core_node1
x:test_col_shard1_replica1] o.a.s.c.S.Request [test_col_shard1_replica1]  webapp=/solr path=/select
hits=1 status=0 QTime=0
1302745 INFO  (qtp1409631559-7344) [n: c:test_col s:shard1 r:core_node1
x:test_col_shard1_replica1] o.a.s.c.S.Request [test_col_shard1_replica1]  webapp=/solr path=/select
status=0 QTime=0
1302746 INFO  (qtp1904355206-7332) [n: c:test_col s:shard1 r:core_node4
x:test_col_shard1_replica2] o.a.s.c.S.Request [test_col_shard1_replica2]  webapp=/solr path=/select
hits=1 status=0 QTime=7

So to sum up: it appears everything i noted before would be easily explained by a some broken
code regarding live nodes that only affected port #56361 while all other nodes had a perfecly
accurate view of the world.

> inexplicable "no servers hosting shard: shard2" using MiniSolrCloudCluster
> --------------------------------------------------------------------------
>                 Key: SOLR-8914
>                 URL:
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Hoss Man
>         Attachments: jenkins.thetaphi.de_Lucene-Solr-6.x-Solaris_32.log.txt
> Jenkin's encountered a failure in TestTolerantUpdateProcessorCloud over the weekend....
> {noformat}
> Checking out Revision c46d7686643e7503304cb35dfe546bce9c6684e7 (refs/remotes/origin/branch_6x)
> Using Java: 64bit/jdk1.8.0 -XX:+UseCompressedOops -XX:+UseG1GC
> {noformat}
> The failure happened during the static setup of the test, when a MiniSolrCloudCluster
& several clients are initialized -- before any code related to TolerantUpdateProcessor
is ever used.
> I can't reproduce this, or really make sense of what i'm (not) seeing here in the logs,
so i'm filing this jira with my analysis in the hopes that someone else can help make sense
of it.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message