lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joseph Obernberger <j...@lovehorsepower.com>
Subject Re: How large is your solr index?
Date Wed, 07 Jan 2015 21:26:30 GMT
Thank you Toke - yes - the data is indexed throughout the day.  We are 
handling very few searches - probably 50 a day; this is an R&D system.
Our HDFS cache, I believe, is too small at 10GBytes per shard.  This 
comes out to 20GBytes of HDFS cache per physical machine plus about 10G 
each for the 2 JVMs running the shards.  Each of those machines is also 
running other services which leaves very little RAM available for FS cache.

Current parameters for running each shard are:
JAVA_OPTS="-XX:MaxDirectMemorySize=10g -XX:+UseLargePages -XX:NewRatio=3 
-XX:SurvivorRatio=4 -XX:TargetSurvivorRatio=90 
-XX:MaxTenuringThreshold=8 -XX:+UseConcMarkSweepGC 
-XX:+CMSScavengeBeforeRemark -XX:PretenureSizeThreshold=64m 
-XX:CMSFullGCsBeforeCompaction=1 -XX:+UseCMSInitiatingOccupancyOnly 
-XX:CMSInitiatingOccupancyFraction=70 -XX:CMSTriggerPermRatio=80 
-XX:CMSMaxAbortablePrecleanTime=6000 -XX:+CMSParallelRemarkEnabled 
-XX:+ParallelRefProcEnabled -XX:+AggressiveOpts -XX:ParallelGCThreads=7 
-Xmx10752m"

I'd love to try SSDs, but don't have the budget at present to go that 
route.  I'd really like to get the HDFS option to work well as it 
reduces system complexity.  It seems to me that if our HDFS cluster has 
lots/enough spindles, performance should be relatively good, as long as 
the OS can actually do some caching.  We will be adding more HDFS nodes 
in the future, increasing spindle count and reducing the amount of data 
stored into Solr.  When we redo our Solr Cloud, we will only run one 
shard per box, and supply more HDFS cache.

-Joe

On 1/7/2015 3:50 PM, Toke Eskildsen wrote:
> Joseph Obernberger [joeo@lovehorsepower.com] wrote:
>
> [HDFS, 9M docs, 2.9TB, 22 shards, 11 bare metal boxes]
>
>> A typical query takes about 7 seconds to run, but we also do faceting
>> and clustering.  Those can take in the 3 - 5 minute range depends on
>> what was queried, but can be as little as 10 seconds. The index contains
>> about 100 fields.
> 7 seconds without faceting seems like a long time. I am guessing your 3M daily updates
are spread throughout the day, instead of being a nightly batch job? How many concurrent searches
are you handling?
>
> We have no experience with HDFS for Solr indexes, but a quick check indicates that it
is not a good fit for Solr. At least not out of the box: http://hbase.apache.org/book.html#perf.hdfs.curr
>
> We did at one point try to use networked storage for our index. That meant 1/3 performance,
compared to local storage, but of course your mileage will vary. As you are looking into ways
of improving performance, what about testing the performance difference with local storage
(SSD of course)?
>
> - Toke Eskildsen
>


Mime
View raw message