lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Smiley <david.w.smi...@gmail.com>
Subject Re: Solr - very long core admin timeouts
Date Fri, 13 Mar 2020 14:03:45 GMT
Thanks for sharing this with us and debugging this reasonably far.

If this happens when a replica is being added, then this makes sense to me
because the replica isn't ready yet to respond with a searcher.
Conditionally if the replica is in recovery, then maybe don't ask for the
indexInfo?

I can see room for improvement in Solr here.  The core admin stats don't
actually need an IndexSearcher, it needs a DirectoryReader which is much
cheaper to get.  Even SolrCore.getIndexSize creates/closes one on the fly.
I filed an issue:
https://issues.apache.org/jira/browse/SOLR-14325

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Fri, Mar 13, 2020 at 8:50 AM Richard Goodman <richardg@brandwatch.com>
wrote:

> Hi there,
>
> I went to a Solr meetup in London a month or two ago, and asked this
> question then, however, people weren't fully sure what the underlying issue
> was and were curious to know.
>
> - Background of our tech stack
>
> Where I work, we run just under 30 solr clusters in our live environment:
>
>    - Most of our clusters have the upper limit capacity of around 100TB,
>    but typically our used capacity per cluster is around 55% on average.
>    - Most of our clusters are running 96 instances, if not more.
>    - Each instance has a TB of storage,
>    - Most of our clusters have 322 collections, where each collection is
>    split into 16 shards and 3-way replication
>    - Each core is on average 20GB, some going up to 40GB.
>    - We are currently running solr v7.7.2
>
>
> We've been dealing with an issue with running Solr Cloud to this size for
> quite some time, whilst it's never caused a catastrophic failure, when it
> comes to us wanting to improve our monitoring, it can become quite a
> blocker for us.
>
> - The issue
>
> We use the core admin API a lot, to get information on core level metrics.
> However, if there are cores recovering, this will lock and slow down the
> API massively, sometimes minutes to get a response from said API.
>
> I worked on implementing the solr prometheus exporter to provide us vision
> on our clusters, however, because of the core admin timeout, I am met with
> tons of errors within the application and unable to collect core level
> metrics. Similarly, if we use the metrics api for the node level metrics
> and core level metrics, it will take a very long time to return results.
>
> - Any findings so far
>
> We discovered that if you supply "&indexInfo=false" to the core api call,
> it will return instantly, but of course you miss some information that is
> quite useful. We have potentially identified where in a jstack the issue is
> occuring, as quoted from a colleague:
>
> "Had a look at the code from my stack trace, as we expected it's waiting
> for the index to be opened to populate the info about numDocs etc, for the
> Core admin API call"
>
> However, not entirely sure what the best approach to fixing this is? We
> were also curious if other people are getting a similar issue?
>
> - Evidence
>
> I created a small dev cluster in our tech stack, and loaded a 30GB backup,
> where each core is around 2GB big on average. I then just ran a time on a
> curl of the core admin api;
>
> *Call before a replica being added:*
>
> root @ rich-solr0.dev-cluster.net ~ # time curl -s "http://localhost:8080/solr/admin/cores"
> core_admin_pre.outreal	0m0.070s
> user	0m0.008s
> sys	0m0.008s
>
>
> *Call during a replica being added:*
>
> root @ rich-solr1.dev-cluster.net ~ # time curl -s "http://localhost:8080/solr/admin/cores"
> core_admin_during.outreal	0m8.130s
> user	0m0.016s
> sys	0m0.008s
>
>
>
> *Call after a replica being added:*
>
> root @ rich-solr1.dev-cluster.net ~ # time curl -s "http://localhost:8080/solr/admin/cores"
> core_admin_done.out
>
> real	0m0.037s
>
> user	0m0.008s
>
> sys	0m0.012s
>
>
> I know here, that the difference is only 8 seconds, however, this is only
> with 2GB cores, in our live environment, it's significantly larger.
>
> I have also collected a bunch of jstacks that are before, during and after
> this occurring, where a previous colleague quoted the following being a
> possible root cause:
>
> "qtp511717113-1006" #1006 prio=5 os_prio=0 tid=0x00007f465800a000 nid=0xc709 in Object.wait()
[0x00007f6b2f954000]
>    java.lang.Thread.State: WAITING (on object monitor)
>         at java.lang.Object.wait(Native Method)
>         at java.lang.Object.wait(Object.java:502)
>         at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2206)
>         - locked <0x00000004d1d75af0> (a java.lang.Object)
>         at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1994)
>         at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1868)
>         at org.apache.solr.handler.admin.CoreAdminOperation.getCoreStatus(CoreAdminOperation.java:339)
>         at org.apache.solr.handler.admin.StatusOp.execute(StatusOp.java:46)
>         at org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:360)
>         at org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:396)
>         at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:180)
>
>
> The stacktraces can be found here
> <https://github.com/richardbadman/core-admin-timeout>.
>
> If anyone has any advice, that would be greatly appreciated.
>
> Cheers,
> --
>
> Richard Goodman
>

Mime
View raw message