lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joseph Obernberger <>
Subject Re: How large is your solr index?
Date Thu, 08 Jan 2015 16:39:09 GMT

On 1/8/2015 3:16 AM, Toke Eskildsen wrote:
> On Wed, 2015-01-07 at 22:26 +0100, Joseph Obernberger wrote:
>> Thank you Toke - yes - the data is indexed throughout the day.  We are
>> handling very few searches - probably 50 a day; this is an R&D system.
> If your searches are in small bundles, you could pause the indexing flow
> while the searches are executed, for better performance.
>> Our HDFS cache, I believe, is too small at 10GBytes per shard.
> That depends a lot on your corpus, your searches and underlying storage.
> But with our current level of information, it is a really good bet:
> Having 10GB cache per 130GB (270GB?) data is not a lot with spinning
> drives.
Yes - it would be 20GBytes of cache per 270GBytes of data.
>> Current parameters for running each shard are:
>> JAVA_OPTS="-XX:MaxDirectMemorySize=10g -XX:+UseLargePages -XX:NewRatio=3
> [...]
>> -Xmx10752m"
> One Solr/shard? You could probably win a bit by having one Solr/machine
> instead. Anyway, it's quite a high Xmx, but I presume you have measured
> the memory needs.
We've tried lower Xmx but we get OOM errors during faceting of large 
datasets.  Right now we're running two JVMs per physical box (2 shards 
per box), but we're going to be changing that to on JVM and one shard 
per box.
>> I'd love to try SSDs, but don't have the budget at present to go that
>> route.
> We find the price/performance for SSD + moderate RAM to be quite a
> better deal than spinning drives + a lot of RAM, even when buying
> enterprise hardware. For consumer SSDs (used in our large server) it is
> even cheaper to use SSDs. It all depends on use pattern of course, but
> your setup with non-concurrent searches seems like it would fit well.
> Note: I am sure that the RAM == index size would deliver very high
> performance. With enough RAM you can use tape to hold the index. Whether
> it is cost effective is another matter.
Ha!  Yes - our index is accessible via a 2400 baud modem, but we have 
lots of cache!  ;)
>> I'd really like to get the HDFS option to work well as it
>> reduces system complexity.
> That is very understandable. We examined the option of networked storage
> (Isilon) with underlying spindles, and it performed adequately for our
> needs up to 2-3TB of index data. Unfortunately the heavy random read
> load from Solr meant a noticeable degradation of other services using
> the networked storage. I am sure it could be solved with more
> centralized hardware, but in the end we found it cheaper and simpler to
> use local storage for search. This will of course differ across
> organizations and setups.

We're going to experiment with the one shard per box and more RAM cache 
per shard and see where that gets us; we'll also be adding more shards.
Thanks for the tips!
Interesting that you mention Isilon as we're planning on doing an eval 
with their product this year where we'll be testing out their HDFS 
layer.  It's a potential way to balance computer and storage since you 
can add HDFS storage without adding compute.

> - Toke Eskildsen

View raw message