lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Toke Eskildsen ...@statsbiblioteket.dk>
Subject Re: How large is your solr index?
Date Thu, 08 Jan 2015 08:16:12 GMT
On Wed, 2015-01-07 at 22:26 +0100, Joseph Obernberger wrote:
> Thank you Toke - yes - the data is indexed throughout the day.  We are 
> handling very few searches - probably 50 a day; this is an R&D system.

If your searches are in small bundles, you could pause the indexing flow
while the searches are executed, for better performance. 

> Our HDFS cache, I believe, is too small at 10GBytes per shard.

That depends a lot on your corpus, your searches and underlying storage.
But with our current level of information, it is a really good bet:
Having 10GB cache per 130GB (270GB?) data is not a lot with spinning
drives.

> Current parameters for running each shard are:
> JAVA_OPTS="-XX:MaxDirectMemorySize=10g -XX:+UseLargePages -XX:NewRatio=3 
[...]
> -Xmx10752m"

One Solr/shard? You could probably win a bit by having one Solr/machine
instead. Anyway, it's quite a high Xmx, but I presume you have measured
the memory needs.

> I'd love to try SSDs, but don't have the budget at present to go that 
> route.

We find the price/performance for SSD + moderate RAM to be quite a
better deal than spinning drives + a lot of RAM, even when buying
enterprise hardware. For consumer SSDs (used in our large server) it is
even cheaper to use SSDs. It all depends on use pattern of course, but
your setup with non-concurrent searches seems like it would fit well.

Note: I am sure that the RAM == index size would deliver very high
performance. With enough RAM you can use tape to hold the index. Whether
it is cost effective is another matter.

> I'd really like to get the HDFS option to work well as it 
> reduces system complexity.

That is very understandable. We examined the option of networked storage
(Isilon) with underlying spindles, and it performed adequately for our
needs up to 2-3TB of index data. Unfortunately the heavy random read
load from Solr meant a noticeable degradation of other services using
the networked storage. I am sure it could be solved with more
centralized hardware, but in the end we found it cheaper and simpler to
use local storage for search. This will of course differ across
organizations and setups.

- Toke Eskildsen



Mime
View raw message