lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SOLR-11035) (at least) 2 distinct failures possible when clients attempt searches during SolrCore reload
Date Sat, 08 Jul 2017 01:44:00 GMT

     [ https://issues.apache.org/jira/browse/SOLR-11035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hoss Man updated SOLR-11035:
----------------------------
    Attachment: SOLR-11035.patch
                log.txt


This issue is the root cause of some suspicious (mostly cloud based) test failures that --
on the surface -- may initially seem like they are either 'impossible' or that the commit
requests done by the tests are returning before the newSearcher was opened (even though {{waitSearcher=true}}).

In my personal experience, the crux of the symptoms are:
* Documents missing from searches done immediately after they were (successfully) added+commited
* Seeds don't reproduce reliably, but fail more commonly when running many tests and/or CPU
usage on the test machine is high.

An example of this class of test failure that initially caught my eye, and I could typically
get to reproduce semi-frequently is DocValuesNotIndexedTest.testGroupingDVOnly.  Running only
that test method -- or even that entire class -- never failed for me, but if i ran {{ant test
-Dtests.class=\*DocValues\*}} I could usually get a consistent failure within ~15 attempts.

One thing most of these failing tests have in common is that instead of a static schema, they
use managed-schema and either use the Schema API to explicitly create the fields needed, or
let the schemaless update processors add fields as needed.

In comparing the logs of "successful" runs vs "failure" runs, the thing that jumped out at
me is that the failure runs contain more SolrCore "reloads" (due to SOLR-11034) and in every
failure case i saw the "CLOSING" log message of a SolrCore was occuring _after_ the document
adds & commit had been logged by that core, but _before_ the log messages from the searches
that failed validation (meaning they were processed by the newly (re)loaded core)

As Shalin notes in SOLR-10562 (where the Congif API is used causing a core reload) in the
process of closing the "old" core and opening the "new" core, there is a race condition where
the "old" core will accept/log/write documents/commits to the index, but these will not be
immediately visible to the SolrIndexSearcher that the "new" core is using once the hand-off
is made.

I wrote the attached test class to try and demonstrate this problem in it's simplest form
(independent of ManagedIndexSchema or SolrCloud or Config API) using concurrent 2 threads:
one to do an "add, commit(waitSearcher), query" and the second to trigger a core reload.

In addition to reproducing the failure I expected (where the query result don't reflect the
last add+commit) i was able to trigger the second type of failure mentioned above: queries
can be routed to the "old" core after/during it's shutdown, causing a solr exception because
the (old) core refuses to open a new searcher.

I tweaked my test as originally written to include a HACK making it possible to demonstrate
both types of failures in a single run -- see the attached log for full details....

{noformat}
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestReloadConcurrentAdds -Dtests.method=testConcurrent
-Dtests.seed=5A728C1E8654399 -Dtests.slow=true -Dtests.locale=es-PA -Dtests.timezone=Atlantic/Faroe
-Dtests.asserts=true -Dtests.file.encoding=UTF-8
   [junit4] ERROR   0.57s | TestReloadConcurrentAdds.testConcurrent <<<
   [junit4]    > Throwable #1: java.lang.RuntimeException: Exception during query
   [junit4]    >        at __randomizedtesting.SeedInfo.seed([5A728C1E8654399:25D4A7DB2FCE2B30]:0)
   [junit4]    >        at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:878)
   [junit4]    >        at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:845)
   [junit4]    >        at org.apache.solr.core.TestReloadConcurrentAdds.testConcurrent(TestReloadConcurrentAdds.java:103)
   [junit4]    >        at org.apache.solr.core.TestReloadConcurrentAdds.testConcurrent(TestReloadConcurrentAdds.java:49)
   [junit4]    >        at java.lang.Thread.run(Thread.java:748)
   [junit4]    > Caused by: org.apache.solr.common.SolrException: openNewSearcher called
on closed core
{noformat}

{noformat}
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestReloadConcurrentAdds -Dtests.method=testConcurrentWithNewSearcherExceptionHack
-Dtests.seed=5A728C1E8654399 -Dtests.slow=true -Dtests.locale=es-PA -Dtests.timezone=Atlantic/Faroe
-Dtests.asserts=true -Dtests.file.encoding=UTF-8
   [junit4] ERROR   0.81s | TestReloadConcurrentAdds.testConcurrentWithNewSearcherExceptionHack
<<<
   [junit4]    > Throwable #1: java.lang.RuntimeException: Exception during query
   [junit4]    >        at __randomizedtesting.SeedInfo.seed([5A728C1E8654399:1C989F4D65319F55]:0)
   [junit4]    >        at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:878)
   [junit4]    >        at org.apache.solr.SolrTestCaseJ4.assertQ(SolrTestCaseJ4.java:845)
   [junit4]    >        at org.apache.solr.core.TestReloadConcurrentAdds.testConcurrent(TestReloadConcurrentAdds.java:103)
   [junit4]    >        at org.apache.solr.core.TestReloadConcurrentAdds.testConcurrentWithNewSearcherExceptionHack(TestReloadConcurrentAdds.java:55)
   [junit4]    >        at java.lang.Thread.run(Thread.java:748)
   [junit4]    > Caused by: java.lang.RuntimeException: REQUEST FAILED: xpath=//result[@numFound='33']
   [junit4]    >        xml response was: <?xml version="1.0" encoding="UTF-8"?>
   [junit4]    > <response>
   [junit4]    > <lst name="responseHeader">
   [junit4]    >   <int name="status">0</int>
   [junit4]    >   <int name="QTime">0</int>
   [junit4]    >   <lst name="params">
   [junit4]    >     <str name="q">*:*</str>
   [junit4]    >     <str name="rows">0</str>
   [junit4]    >     <str name="wt">xml</str>
   [junit4]    >   </lst>
   [junit4]    > </lst>
   [junit4]    > <result name="response" numFound="32" start="0">
{noformat}


> (at least) 2 distinct failures possible when clients attempt searches during SolrCore
reload
> --------------------------------------------------------------------------------------------
>
>                 Key: SOLR-11035
>                 URL: https://issues.apache.org/jira/browse/SOLR-11035
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>         Attachments: log.txt, SOLR-11035.patch
>
>
> If a SolrCore is reloaded, there are (at least) 2 distinct types of failures that clients
may observe when executing updates + queries while the reload is in progress...
> * documents may appear missing during queries
> * queries may fail with "SolrException: openNewSearcher called on closed core"
> Details to follow in comment...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message