lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SOLR-8914) inexplicable "no servers hosting shard: shard2" using MiniSolrCloudCluster
Date Mon, 28 Mar 2016 18:16:25 GMT

     [ https://issues.apache.org/jira/browse/SOLR-8914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hoss Man updated SOLR-8914:
---------------------------
    Attachment: jenkins.thetaphi.de_Lucene-Solr-6.x-Solaris_32.log.txt


The specific assertion that failed had to do with a sanity checking query for 2 docs that
are added to ensure that some compositId routing prefixes used throughout the tests do actually
corrispond to the expected shard names -- but for the purpose of discussing this particular
jenkins failure, the key things to know about this test failure are...

* there are 2 shards w/ repfactor=2 which are inited successfully & not undergoing recovery
* 2 docs are (successful) added, with prefixes such that one is located on each shard
* a hard commit is (sucessfully) executed from the perspective of the CloudSolrClient
** {color:red}but only shard1 logs the commit & newSearcher{color}
* a {{\*:\*}} query is executed...
** only 1 doc is found by this query
** an error is logged by a shard1 node that {color:red}no servers hosting shard: shard2{color}
** {color:red}a shard2 node does in fact log that it was consulted as part of this query{color}


Which raises 2 key questions I can't make sense of...

* Why didn't either shard2 node ever log (and execute) the commit & opening of a new searcher?
* Why did a shard1 node log an error that "no servers hosting shard: shard2" even though the
shard2's core_node2 clearly received the first phase of the distributed request & responded
with a success?


I've attached the full log file, and provided my detailed notes, line by line (of both the
relevant code & the log file) below - NOTE: all "Lxx" line numbers refer to GIT SHA c46d768
since that's what jenkins was testing)

----

* TestTolerantUpdateProcessorCloud.createMiniSolrCloudCluster...
** begins by creating a MiniSolrCloudCluster & initializes a collection of 5 nodes with
2 shards and refactor=2
* timestamp #1302667 - #1302668
** TestTolerantUpdateProcessorCloud.createMiniSolrCloudCluster @ L129
** call to AbstractDistribZkTestBase.waitForRecoveriesToFinish
** can see from the log that all 4 nodes are good to go...{noformat}
1302667 INFO  (SUITE-TestTolerantUpdateProcessorCloud-seed#[F0FBD4E7705C29B5]-worker) [  
 ] o.a.s.c.AbstractDistribZkTestBase Wait for recoveries to finish - collection: test_col
failOnTimeout:true timeout (sec):330
-
replica:core_node1 rstate:active live:true
replica:core_node4 rstate:active live:true
replica:core_node2 rstate:active live:true
replica:core_node3 rstate:active live:true
no one is recoverying
1302668 INFO  (SUITE-TestTolerantUpdateProcessorCloud-seed#[F0FBD4E7705C29B5]-worker) [  
 ] o.a.s.c.AbstractDistribZkTestBase Recoveries finished - collection: test_col
{noformat}
* TestTolerantUpdateProcessorCloud.createMiniSolrCloudCluster @ L132 - L183
** test is inspecting ZK state to init some HTTP SolrClients pointed at the URLs for specific
leaders & replicas (these are used to test all possible forwarding situations in the various
test methods)
** timestamp #1302671 shows ZkStateReader used refreshing it's live nodes info for this purpose
(L141)
* TestTolerantUpdateProcessorCloud.createMiniSolrCloudCluster @ L184 - L188
** adding 2 docs with specific routing prefixes
** ultimate goal is to sanity check that the route prefixes match to the expected shards so
other parts of test can assume this.
** timestamp #1302696 - #1302719 record these 2 docs being added to both replicas of shard1
& shard2...{noformat}
1302696 INFO  (qtp1904355206-7333) [n:127.0.0.1:34275_solr c:test_col s:shard1 r:core_node4
x:test_col_shard1_replica2] o.a.s.u.p.LogUpdateProcessorFactory [test_col_shard1_replica2]
 webapp=/solr path=/update params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:56361/solr/test_col_shard1_replica1/&wt=javabin&version=2}{add=[abc!447049285
(1529857595576156160)]} 0 4
1302697 INFO  (qtp1409631559-7341) [n:127.0.0.1:56361_solr c:test_col s:shard1 r:core_node1
x:test_col_shard1_replica1] o.a.s.u.p.LogUpdateProcessorFactory [test_col_shard1_replica1]
 webapp=/solr path=/update params={wt=javabin&version=2}{add=[abc!447049285 (1529857595576156160)]}
0 14
1302719 INFO  (qtp797763926-7371) [n:127.0.0.1:51144_solr c:test_col s:shard2 r:core_node3
x:test_col_shard2_replica1] o.a.s.u.p.LogUpdateProcessorFactory [test_col_shard2_replica1]
 webapp=/solr path=/update params={update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:33616/solr/test_col_shard2_replica2/&wt=javabin&version=2}{add=[XYZ!701047101
(1529857595596079104)]} 0 7
1302719 INFO  (qtp316491708-7355) [n:127.0.0.1:33616_solr c:test_col s:shard2 r:core_node2
x:test_col_shard2_replica2] o.a.s.u.p.LogUpdateProcessorFactory [test_col_shard2_replica2]
 webapp=/solr path=/update params={wt=javabin&version=2}{add=[XYZ!701047101 (1529857595596079104)]}
0 18
{noformat}
* TestTolerantUpdateProcessorCloud.createMiniSolrCloudCluster @ L189
** do a commit using the CloudSolrClient
** timestamps #1302724 - #1302733 show both replicas of shard1 acting on this commit &
opening new searchers...{noformat}
1302724 INFO  (qtp1409631559-7343) [n:127.0.0.1:56361_solr c:test_col s:shard1 r:core_node1
x:test_col_shard1_replica1] o.a.s.u.DirectUpdateHandler2 start commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
1302724 INFO  (qtp1904355206-7330) [n:127.0.0.1:34275_solr c:test_col s:shard1 r:core_node4
x:test_col_shard1_replica2] o.a.s.u.DirectUpdateHandler2 start commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
1302727 INFO  (qtp1904355206-7330) [n:127.0.0.1:34275_solr c:test_col s:shard1 r:core_node4
x:test_col_shard1_replica2] o.a.s.c.SolrDeletionPolicy SolrDeletionPolicy.onCommit: commits:
num=2
	commit{dir=MockDirectoryWrapper(RAMDirectory@21c6a821 lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@26734241),segFN=segments_1,generation=1}
	commit{dir=MockDirectoryWrapper(RAMDirectory@21c6a821 lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@26734241),segFN=segments_2,generation=2}
1302727 INFO  (qtp1409631559-7343) [n:127.0.0.1:56361_solr c:test_col s:shard1 r:core_node1
x:test_col_shard1_replica1] o.a.s.c.SolrDeletionPolicy SolrDeletionPolicy.onCommit: commits:
num=2
	commit{dir=MockDirectoryWrapper(RAMDirectory@5616141f lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@6180bc4c),segFN=segments_1,generation=1}
	commit{dir=MockDirectoryWrapper(RAMDirectory@5616141f lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@6180bc4c),segFN=segments_2,generation=2}
1302728 INFO  (qtp1904355206-7330) [n:127.0.0.1:34275_solr c:test_col s:shard1 r:core_node4
x:test_col_shard1_replica2] o.a.s.c.SolrDeletionPolicy newest commit generation = 2
1302729 INFO  (qtp1409631559-7343) [n:127.0.0.1:56361_solr c:test_col s:shard1 r:core_node1
x:test_col_shard1_replica1] o.a.s.c.SolrDeletionPolicy newest commit generation = 2
1302730 INFO  (qtp1409631559-7343) [n:127.0.0.1:56361_solr c:test_col s:shard1 r:core_node1
x:test_col_shard1_replica1] o.a.s.s.SolrIndexSearcher Opening [Searcher@f331408[test_col_shard1_replica1]
main]
1302730 INFO  (qtp1904355206-7330) [n:127.0.0.1:34275_solr c:test_col s:shard1 r:core_node4
x:test_col_shard1_replica2] o.a.s.s.SolrIndexSearcher Opening [Searcher@5d1967dd[test_col_shard1_replica2]
main]
1302730 INFO  (qtp1409631559-7343) [n:127.0.0.1:56361_solr c:test_col s:shard1 r:core_node1
x:test_col_shard1_replica1] o.a.s.u.DirectUpdateHandler2 end_commit_flush
1302730 INFO  (qtp1904355206-7330) [n:127.0.0.1:34275_solr c:test_col s:shard1 r:core_node4
x:test_col_shard1_replica2] o.a.s.u.DirectUpdateHandler2 end_commit_flush
1302731 INFO  (searcherExecutor-3283-thread-1-processing-n:127.0.0.1:56361_solr x:test_col_shard1_replica1
s:shard1 c:test_col r:core_node1) [n:127.0.0.1:56361_solr c:test_col s:shard1 r:core_node1
x:test_col_shard1_replica1] o.a.s.c.SolrCore [test_col_shard1_replica1] Registered new searcher
Searcher@f331408[test_col_shard1_replica1] main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_0(6.1.0):c1)))}
1302732 INFO  (qtp1409631559-7343) [n:127.0.0.1:56361_solr c:test_col s:shard1 r:core_node1
x:test_col_shard1_replica1] o.a.s.u.p.LogUpdateProcessorFactory [test_col_shard1_replica1]
 webapp=/solr path=/update params={update.distrib=FROMLEADER&waitSearcher=true&openSearcher=true&commit=true&softCommit=false&distrib.from=http://127.0.0.1:56361/solr/test_col_shard1_replica1/&commit_end_point=true&wt=javabin&version=2&expungeDeletes=false}{commit=}
0 7
1302732 INFO  (searcherExecutor-3284-thread-1-processing-n:127.0.0.1:34275_solr x:test_col_shard1_replica2
s:shard1 c:test_col r:core_node4) [n:127.0.0.1:34275_solr c:test_col s:shard1 r:core_node4
x:test_col_shard1_replica2] o.a.s.c.SolrCore [test_col_shard1_replica2] Registered new searcher
Searcher@5d1967dd[test_col_shard1_replica2] main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_0(6.1.0):c1)))}
1302732 INFO  (qtp1904355206-7330) [n:127.0.0.1:34275_solr c:test_col s:shard1 r:core_node4
x:test_col_shard1_replica2] o.a.s.u.p.LogUpdateProcessorFactory [test_col_shard1_replica2]
 webapp=/solr path=/update params={update.distrib=FROMLEADER&waitSearcher=true&openSearcher=true&commit=true&softCommit=false&distrib.from=http://127.0.0.1:56361/solr/test_col_shard1_replica1/&commit_end_point=true&wt=javabin&version=2&expungeDeletes=false}{commit=}
0 8
1302733 INFO  (qtp1409631559-7345) [n:127.0.0.1:56361_solr c:test_col s:shard1 r:core_node1
x:test_col_shard1_replica1] o.a.s.u.p.LogUpdateProcessorFactory [test_col_shard1_replica1]
 webapp=/solr path=/update params={_stateVer_=test_col:7&waitSearcher=true&commit=true&softCommit=false&wt=javabin&version=2}{commit=}
0 11
{noformat}
** {color:red}What the hell hapened to shard2 ?! why isn't shard#2 logging anything here?!{color}
* TestTolerantUpdateProcessorCloud.createMiniSolrCloudCluster @ L190
** do a {{\*:\*}} query using the CloudSolrClient
** goal is to assert that the {{\[shard\]}} value for each of our doc (prefixes) matches the
expectation the rest of the test will have
** timestamp #1302737 shows an error log from a shard1 node indicating that both replicas
of shard2 have just vanished completley...{noformat}
1302737 ERROR (qtp1409631559-7347) [n:127.0.0.1:56361_solr c:test_col s:shard1 r:core_node1
x:test_col_shard1_replica1] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException:
no servers hosting shard: shard2
	at org.apache.solr.handler.component.HttpShardHandler.prepDistributed(HttpShardHandler.java:451)
	at org.apache.solr.handler.component.SearchHandler.getAndPrepShardHandler(SearchHandler.java:224)
	at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:262)
	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:155)
	at org.apache.solr.core.SolrCore.execute(SolrCore.java:2033)
	at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:652)
	at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:460)
	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:229)
	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:184)
	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676)
	at org.apache.solr.client.solrj.embedded.JettySolrRunner$DebugFilter.doFilter(JettySolrRunner.java:109)
	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676)
	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
	at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:224)
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1160)
	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
	at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1092)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
	at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:462)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
	at org.eclipse.jetty.server.Server.handle(Server.java:518)
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:308)
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:244)
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
	at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
	at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:246)
	at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:156)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
	at java.lang.Thread.run(Thread.java:745)

1302737 INFO  (qtp1409631559-7347) [n:127.0.0.1:56361_solr c:test_col s:shard1 r:core_node1
x:test_col_shard1_replica1] o.a.s.c.S.Request [test_col_shard1_replica1]  webapp=/solr path=/select
params={q=*:*&_stateVer_=test_col:7&fl=id,expected_shard_s,[shard]&wt=javabin&version=2}
status=503 QTime=0
{noformat}
** but 5ms latter, timestamp #1302742 shows shard2's core_node2 logging it's successful response
to the initial phase of the distributed query (although with {{hits=0}} since it never processed
the commit or opened a new searcher after the documents where added) ...{noformat}
1302742 INFO  (qtp316491708-7357) [n:127.0.0.1:33616_solr c:test_col s:shard2 r:core_node2
x:test_col_shard2_replica2] o.a.s.c.S.Request [test_col_shard2_replica2]  webapp=/solr path=/select
params={distrib=false&_stateVer_=test_col:7&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=http://127.0.0.1:33616/solr/test_col_shard2_replica2/|http://127.0.0.1:51144/solr/test_col_shard2_replica1/&rows=10&version=2&q=*:*&NOW=1458985896717&isShard=true&wt=javabin}
hits=0 status=0 QTime=0
{noformat}
* The test ultimately fails at because the {{\*:\*}} sanity check of numDocs=2 fails
** TestTolerantUpdateProcessorCloud.createMiniSolrCloudCluster @ L192
* other then the {{hits=0}} logging mentioned above (#1302742), there is no logging from either
of the shard2 nodes between timestamp #1302719 when both replicas accept the doc add, and
#1302758 when the test starts shutting down nodes and closing cores.



> inexplicable "no servers hosting shard: shard2" using MiniSolrCloudCluster
> --------------------------------------------------------------------------
>
>                 Key: SOLR-8914
>                 URL: https://issues.apache.org/jira/browse/SOLR-8914
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Hoss Man
>         Attachments: jenkins.thetaphi.de_Lucene-Solr-6.x-Solaris_32.log.txt
>
>
> Jenkin's encountered a failure in TestTolerantUpdateProcessorCloud over the weekend....
> {noformat}
> http://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Solaris/32/consoleText
> Checking out Revision c46d7686643e7503304cb35dfe546bce9c6684e7 (refs/remotes/origin/branch_6x)
> Using Java: 64bit/jdk1.8.0 -XX:+UseCompressedOops -XX:+UseG1GC
> {noformat}
> The failure happened during the static setup of the test, when a MiniSolrCloudCluster
& several clients are initialized -- before any code related to TolerantUpdateProcessor
is ever used.
> I can't reproduce this, or really make sense of what i'm (not) seeing here in the logs,
so i'm filing this jira with my analysis in the hopes that someone else can help make sense
of it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message