lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Weiss <>
Subject SOLR Memory Usage - Where does it go?
Date Fri, 23 Jul 2010 23:37:38 GMT
We have been having problems with SOLR on one project lately.  Forgive  
me for writing a novel here but it's really important that we identify  
the root cause of this issue.  It is becoming unavailable at random  
intervals, and the problem appears to be memory related.  There are  
basically two ways it goes:

1) Straight up OOM error, either from Java or sometimes from the  
kernel itself.

2) Instead of throwing an OOM, the memory usage gets very high and  
then drops precipitously (say, from 92% (of 20GB) down to 60%).  Once  
the memory usage is done dropping, SOLR seems to stop responding to  
requests altogether.

It started out mostly being version #1 of the problem but now we're  
mostly seeing version #2 of the problem... and it's getting more and  
more frequent.  In either scenario the servlet container (Jetty) needs  
to be restarted to resume service.

The number of documents in the index is always going up.  They are  
relatively small in size (1K per piece max - mostly small numeric  
strings, with 5 text fields (one each for 5 languages) that are rarely  
more than 50-100 characters), and there are about 5 million of them at  
the moment (adding around 1000 every day).  The machine has 20 GB of  
RAM, Xmx is set to 18GB, and SOLR is the only thing this machine /  
servlet container does.  There are a couple other cores configured,  
but they are miniscule in comparison (one with 200000 docs, and two  
more with < 10000 docs a piece).  Eliminating these other cores does  
not seem to make any significant impact.  This is with the SOLR 1.4.1  
release, using the SOLR-236 patch that was recently released to go  
with this version.  The patch was slightly modified in order to ensure  
that paging continued to work properly  - basically, an optimization  
that eliminated paging was removed per the instructions in this comment:


I realize this is not ideal if you want to control memory usage, but  
the design requirements of the project preclude us from eliminating  
either collapsing or paging.  It's also probably worth noting that  
these problems did not start with version 1.4.1 or this version of the  
236 patch - we actually upgraded from 1.4 because they said it fixed  
some memory leaks, hoping it would help solve this problem.

We have some test machines set up and we have been testing out various  
configuration changes.  Watching the stats in the admin area, this is  
what we've been able to figure out:

1) The fieldValueCache usage stays constant at 23 entries (one for  
each faceted field), and takes up a total size of about 750MB  

2) Lowering or just eliminating the filterCache and the  
queryResultCache does not seem to have any serious impact - perhaps a  
difference of a few percent at the start, but after prolonged usage  
the memory still goes up seemingly uncontrolled.  It would appear the  
queryResultCache does not get much usage anyway, and even though we  
have higher eviction rates in the filterCache, this really doesn't  
seem to impact performance significantly.

3) Lowering or eliminating the documentCache also doesn't seem to have  
very much impact in memory usage, although it does make searches much  

4) We followed the instructions for configuring the HashDocSet  
parameter, but this doesn't seem to be having much impact either.

5)  All the caches, with the exception of the documentCache, are  
FastLRUCaches.  Switching between FastLRUCache and normal LRUCache in  
general doesn't seem to change the memory usage.

6) Glancing through all of the data on memory usage in the Lucene  
fieldCache would indicate that this cache is using well under 1GB of  
RAM as well.

Basically, when the servlet first starts, it uses very little RAM  
(<4%).  We warm the searcher with a few standard queries that  
initialize everything in the fieldValueCache off the bat, and the  
query performance levels off at a reasonable speed, with memory usage  
around 10-12%.  At this point, almost all queries execute within a few  
100ms, if not faster.  A very few queries that return large numbers of  
collapsed documents, generally 800K up to about 2 million (we have  
about 5 distinct queries that do this), will take up to 20 seconds to  
run the first time, and up to 10 seconds thereafter.  Even after  
running all these queries, memory usage stays around 20-30%.  At this  
point, performance is optimal.  We simulate production usage, running  
queries taken from those logs through the system at a rate similar to  
production use.

For the most part, memory usage stays level.  Usage will go up as  
queries are run (this seems to correspond with when they are being  
collapsed), but then go back down as the results are returned.  Then,  
over the course of a few hours, at seemingly random intervals, memory  
usage will go up and stay up, plateauing at some new level.   
Performance doesn't change really at this point - it's still the same  
speed it was before.  SOLR is simply using more memory than it was  
before, but not really doing anything more than it was before either.   
If we look at the stats on the caches, the caches do not seem to be  
any larger than they were after it was freshly warmed.    Eventually,  
RAM usage hits 40%, then 50% an hour or two later, until after about  
8-12 hours it tops out around 90%.

One guess was that SOLR is starting more threads to handle more  
requests - however, this isn't borne out in the process list - when I  
check, the number of threads is level at 33 threads, and all the  
threads have the same start time.   I'm not intimately familiar with  
how Jetty works with threading, but it also seems that all the threads  
share the same caches - otherwise, one would expect to see stats on  
the different caches in the statistics page (or at least see them  
change, depending on what thread one was using).

If SOLR's not using this RAM for caching, and it's not using it for  
new documents (we've completely eliminated commits from the equation -  
in fact, this seems to happen more when there are fewer commits, which  
unfortunately means overnight) - what is this RAM going towards?  It  
doesn't make any sense to me that it's answering the same queries at  
the same speed, but 4 hours later it needs twice as much memory to do  
the same thing.  If this is a problem with the collapse patch, what is  
it doing that it needs to leave such high volumes of data in memory,  
even long after it's done doing its work?  If it's not the collapse  
patch, then what could it be?  Unfortunately, it's really hard to tell  
how much RAM most of the caches are using because this information is  
not uniformly displayed on the statistics page - we know how many  
entries there are, but we don't know how big the entries are or if the  
size of the entries changes over time.  But in any event, after we  
turn all caching off, it still seems to happen anyway, so at this  
point it seems safe to say that the excess RAM is not being used for  
cache anyway.

At the moment, I feel like we've tweaked everything we can think of in  
the solrconfig.xml with little change in how it operates.  I'm going  
to go look now and see if perhaps this might be an issue with the  
servlet container itself - this is Jetty 6.1.12, we're a little  
behind.  But if anyone has any ideas as to where else this memory  
could be going, and what practical steps we could do to at least keep  
the server from OOMing, any information would be helpful.



View raw message