lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Richard Goodman <richa...@brandwatch.com>
Subject Solr - very long core admin timeouts
Date Fri, 13 Mar 2020 12:50:33 GMT
Hi there,

I went to a Solr meetup in London a month or two ago, and asked this
question then, however, people weren't fully sure what the underlying issue
was and were curious to know.

- Background of our tech stack

Where I work, we run just under 30 solr clusters in our live environment:

   - Most of our clusters have the upper limit capacity of around 100TB,
   but typically our used capacity per cluster is around 55% on average.
   - Most of our clusters are running 96 instances, if not more.
   - Each instance has a TB of storage,
   - Most of our clusters have 322 collections, where each collection is
   split into 16 shards and 3-way replication
   - Each core is on average 20GB, some going up to 40GB.
   - We are currently running solr v7.7.2


We've been dealing with an issue with running Solr Cloud to this size for
quite some time, whilst it's never caused a catastrophic failure, when it
comes to us wanting to improve our monitoring, it can become quite a
blocker for us.

- The issue

We use the core admin API a lot, to get information on core level metrics.
However, if there are cores recovering, this will lock and slow down the
API massively, sometimes minutes to get a response from said API.

I worked on implementing the solr prometheus exporter to provide us vision
on our clusters, however, because of the core admin timeout, I am met with
tons of errors within the application and unable to collect core level
metrics. Similarly, if we use the metrics api for the node level metrics
and core level metrics, it will take a very long time to return results.

- Any findings so far

We discovered that if you supply "&indexInfo=false" to the core api call,
it will return instantly, but of course you miss some information that is
quite useful. We have potentially identified where in a jstack the issue is
occuring, as quoted from a colleague:

"Had a look at the code from my stack trace, as we expected it's waiting
for the index to be opened to populate the info about numDocs etc, for the
Core admin API call"

However, not entirely sure what the best approach to fixing this is? We
were also curious if other people are getting a similar issue?

- Evidence

I created a small dev cluster in our tech stack, and loaded a 30GB backup,
where each core is around 2GB big on average. I then just ran a time on a
curl of the core admin api;

*Call before a replica being added:*

root @ rich-solr0.dev-cluster.net ~ # time curl -s
"http://localhost:8080/solr/admin/cores" >
core_admin_pre.outreal	0m0.070s
user	0m0.008s
sys	0m0.008s


*Call during a replica being added:*

root @ rich-solr1.dev-cluster.net ~ # time curl -s
"http://localhost:8080/solr/admin/cores" >
core_admin_during.outreal	0m8.130s
user	0m0.016s
sys	0m0.008s



*Call after a replica being added:*

root @ rich-solr1.dev-cluster.net ~ # time curl -s
"http://localhost:8080/solr/admin/cores" > core_admin_done.out

real	0m0.037s

user	0m0.008s

sys	0m0.012s


I know here, that the difference is only 8 seconds, however, this is only
with 2GB cores, in our live environment, it's significantly larger.

I have also collected a bunch of jstacks that are before, during and after
this occurring, where a previous colleague quoted the following being a
possible root cause:

"qtp511717113-1006" #1006 prio=5 os_prio=0 tid=0x00007f465800a000
nid=0xc709 in Object.wait() [0x00007f6b2f954000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:502)
        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2206)
        - locked <0x00000004d1d75af0> (a java.lang.Object)
        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1994)
        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1868)
        at org.apache.solr.handler.admin.CoreAdminOperation.getCoreStatus(CoreAdminOperation.java:339)
        at org.apache.solr.handler.admin.StatusOp.execute(StatusOp.java:46)
        at org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:360)
        at org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:396)
        at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:180)


The stacktraces can be found here
<https://github.com/richardbadman/core-admin-timeout>.

If anyone has any advice, that would be greatly appreciated.

Cheers,
-- 

Richard Goodman

Mime
View raw message