lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Memory leak in Solr
Date Sun, 04 Dec 2016 16:59:40 GMT
All of this is consistent with not having a properly
tuned Solr instance wrt # documents, usage
pattern, memory allocated to the JVM, GC
settings and the like.

Your leader issues can be explained by long
GC pauses too. Zookeeper periodically pings
each replica it knows about and if the response
times out (due to GC in this case) then Zookeeper
thinks the node has gone away and marks
it as "down". Similarly when a leader forwards
an update to a follower and the request times
out, the leader will mark the follower as down.
Do this enough and the state of the cluster gets
"interesting".

You still haven't told us what version of Solr
you're using, the "Version" you took from
the core stats is the version of the _index_,
not Solr.

You have almost 200M documents on
a single core. That's definitely on the high side,
although I've seen that work. Assuming
you aren't doing things like faceting and
sorting and the like on non docValues fields.

As others have pointed out, the link you
provided doesn't provide much in the way of
any "smoking guns" as far as a memory
leak is concerned.

I've certainly seen situations where memory
required by Solr is close to the total memory
allocated to the JVM for instance. Then the GC
cycle kicks in and recovers just enough to
go on for a very brief time before going into another
GC cycle resulting in very poor performance.

So overall this looks like you need to do some
serious tuning of your Solr instances, take a
hard look at how you're using your physical
machines. You specify that these are VMs,
but how many VMs are you running per box?
How much JVM have you allocated for each?
How much total physical memory do you have
to work with per box?

Even if you provide the answers to the above
questions, there's not much we can do to
help you resolve your issues assuming it's
simply inappropriate sizing. I'd really recommend
you create a stress environment so you can
test different scenarios to become confident about
your expected performance, here's a blog on the
subject:

https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Best,
Erick

On Sat, Dec 3, 2016 at 8:46 PM, S G <sg.online.email@gmail.com> wrote:
> The symptom we see is that the java clients querying Solr see response
> times in 10s of seconds (not milliseconds).
> And on the tomcat's gc.log file (where Solr is running), we see very bad GC
> pauses - threads being paused for 0.5 seconds per second approximately.
>
> Some numbers for the Solr Cloud:
>
> *Overall infrastructure:*
> - Only one collection
> - 16 VMs used
> - 8 shards (1 leader and 1 replica per shard - each core on separate VM)
>
> *Overview from one core:*
> - Num Docs:193,623,388
> - Max Doc:230,577,696
> - Heap Memory Usage:231,217,880
> - Deleted Docs:36,954,308
> - Version:2,357,757
> - Segment Count:37
>
> *Stats from QueryHandler/select*
> - requests:78,557
> - errors:358
> - timeouts:0
> - totalTime:1,639,975.27
> - avgRequestsPerSecond:2.62
> - 5minRateReqsPerSecond:1.39
> - 15minRateReqsPerSecond:1.64
> - avgTimePerRequest:20.87
> - medianRequestTime:0.70
> - 75thPcRequestTime:1.11
> - 95thPcRequestTime:191.76
>
> *Stats from QueryHandler/update*
> - requests:33,555
> - errors:0
> - timeouts:0
> - totalTime:227,870.58
> - avgRequestsPerSecond:1.12
> - 5minRateReqsPerSecond:1.16
> - 15minRateReqsPerSecond:1.23
> - avgTimePerRequest:6.79
> - medianRequestTime:3.16
> - 75thPcRequestTime:5.27
> - 95thPcRequestTime:9.33
>
> And yet the Solr clients are reporting timeouts and very long read times.
>
> Plus, on every server, we are seeing lots of exceptions.
> For example:
>
> Between 8:06:55 PM and 8:21:36 PM, exceptions are:
>
> 1) Request says it is coming from leader, but we are the leader:
> update.distrib=FROMLEADER&distrib.from=HOSTB_ca_1_1456430020/&wt=javabin&version=2
>
> 2) org.apache.solr.common.SolrException: Request says it is coming from
> leader, but we are the leader
>
> 3) org.apache.solr.common.SolrException:
> org.apache.solr.client.solrj.SolrServerException: Tried one server for read
> operation and it timed out, so failing fast
>
> 4) null:org.apache.solr.common.SolrException:
> org.apache.solr.client.solrj.SolrServerException: Tried one server for read
> operation and it timed out, so failing fast
>
> 5) org.apache.solr.common.SolrException:
> org.apache.solr.client.solrj.SolrServerException: Tried one server for read
> operation and it timed out, so failing fast
>
> 6) null:org.apache.solr.common.SolrException:
> org.apache.solr.client.solrj.SolrServerException: Tried one server for read
> operation and it timed out, so failing fast
>
> 7) org.apache.solr.common.SolrException:
> org.apache.solr.client.solrj.SolrServerException: No live SolrServers
> available to handle this request. Zombie server list:
> [HOSTA_ca_1_1456429897]
>
> 8) null:org.apache.solr.common.SolrException:
> org.apache.solr.client.solrj.SolrServerException: No live SolrServers
> available to handle this request. Zombie server list:
> [HOSTA_ca_1_1456429897]
>
> 9) org.apache.solr.common.SolrException:
> org.apache.solr.client.solrj.SolrServerException: Tried one server for read
> operation and it timed out, so failing fast
>
> 10) null:org.apache.solr.common.SolrException:
> org.apache.solr.client.solrj.SolrServerException: Tried one server for read
> operation and it timed out, so failing fast
>
> 11) org.apache.solr.common.SolrException:
> org.apache.solr.client.solrj.SolrServerException: Tried one server for read
> operation and it timed out, so failing fast
>
> 12) null:org.apache.solr.common.SolrException:
> org.apache.solr.client.solrj.SolrServerException: Tried one server for read
> operation and it timed out, so failing fast
>
> Why are we seeing so many timeouts then and why so huge response times on
> the client?
>
> Thanks
> SG
>
>
>
> On Sat, Dec 3, 2016 at 4:19 PM, <billnbell@gmail.com> wrote:
>
>> What tool is that ? The stats I would like to run on my Solr instance
>>
>> Bill Bell
>> Sent from mobile
>>
>>
>> > On Dec 2, 2016, at 4:49 PM, Shawn Heisey <apache@elyograg.org> wrote:
>> >
>> >> On 12/2/2016 12:01 PM, S G wrote:
>> >> This post shows some stats on Solr which indicate that there might be a
>> >> memory leak in there.
>> >>
>> >> http://stackoverflow.com/questions/40939166/is-this-a-
>> memory-leak-in-solr
>> >>
>> >> Can someone please help to debug this?
>> >> It might be a very good step in making Solr stable if we can fix this.
>> >
>> > +1 to what Walter said.
>> >
>> > I replied earlier on the stackoverflow question.
>> >
>> > FYI -- your 95th percentile request time of about 16 milliseconds is NOT
>> > something that I would characterize as "very high."  I would *love* to
>> > have statistics that good.
>> >
>> > Even your 99th percentile request time is not much more than a full
>> > second.  If a search takes a couple of seconds, most users will not
>> > really care, and some might not even notice.  It's when a large
>> > percentage of queries start taking several seconds that complaints start
>> > coming in.  On your system, 99 percent of your queries are completing in
>> > 1.3 seconds or less, and 95 percent of them are less than 17
>> > milliseconds.  That sounds quite good to me.
>> >
>> > In my experience, the time it takes for the browser to receive the
>> > search result page and render it is a significant part of the total time
>> > to see results, and often dwarfs the time spent getting info from Solr.
>> >
>> > Here's some numbers from Solr in my organization:
>> >
>> > requests:               4102054
>> > errors:                 364894
>> > timeouts:               49
>> > totalTime:              799446287.45041
>> > avgRequestsPerSecond:   1.2375565828793849
>> > 5minRateReqsPerSecond:  0.8444329508327961
>> > 15minRateReqsPerSecond: 0.8631197328073346
>> > avgTimePerRequest:      194.88926460997587
>> > medianRequestTime:      20.8566605
>> > 75thPcRequestTime:      85.51328849999999
>> > 95thPcRequestTime:      2202.277466549999
>> > 99thPcRequestTime:      5280.375381280002
>> > 999thPcRequestTime:     6866.020122961001
>> >
>> > The numbers above come from a distributed index that contains 167
>> > million documents and takes up about 200GB of disk space across two
>> > machines.
>> >
>> > requests:               192683
>> > errors:                 124
>> > timeouts:               0
>> > totalTime:              199380421.985073
>> > avgRequestsPerSecond    0.042222722771354554
>> > 5minRateReqsPerSecon    0.00800545427600684
>> > 15minRateReqsPerSecond: 0.017521222412364163
>> > avgTimePerRequest:      1034.7587591280653
>> > medianRequestTime:      541.591858
>> > 75thPcRequestTime:      1683.83246125
>> > 95thPcRequestTime:      5644.542019949997
>> > 99thPcRequestTime:      9445.592394760004
>> > 999thPcRequestTime:     14602.166640771007
>> >
>> > These numbers are from an index with about 394 million documents, taking
>> > up nearly 500GB of disk space.  This index is also distributed on
>> > multiple machines.
>> >
>> > Are you experiencing any problems other than what you perceive as slow
>> > queries?  I asked some other questions on stackoverflow.  In particular,
>> > I'd like to know the total memory on the server, the total number of
>> > documents (maxDoc and numDoc) you're handling with this server, as well
>> > as the total index size.  What do your queries look like?  What version
>> > and vendor of Java are you using?  Can you share your config/schema?
>> >
>> > A memory leak is very unlikely, unless your Java or your operating
>> > system is broken.  I can't say for sure that it's not happening, but
>> > it's just not something we see around here.
>> >
>> > Here's what I have collected on performance issues in Solr.  This page
>> > does mostly concern itself with memory, though it touches briefly on
>> > other topics:
>> >
>> > https://wiki.apache.org/solr/SolrPerformanceProblems
>> >
>> > Thanks,
>> > Shawn
>> >
>>

Mime
View raw message