lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joseph Obernberger <>
Subject Re: How large is your solr index?
Date Wed, 07 Jan 2015 17:50:06 GMT
Kinda late to the party on this very interesting thread, but I'm 
wondering if anyone has been using SolrCloud with HDFS at large scales?  
We really like this capability since our data is inside of Hadoop and we 
can run the Solr shards on the same nodes, and we only need to manage 
one pool of storage (HDFS).  Our current setup consists of 11 systems 
(bare metal - although our cluster is actually 16 nodes not all run 
Solr) with a 2.9TByte index and just under 900 million docs spread 
across 22 shards (2 shards per physical box) in a single collection.  We 
index around 3 million docs per day.

A typical query takes about 7 seconds to run, but we also do faceting 
and clustering.  Those can take in the 3 - 5 minute range depends on 
what was queried, but can be as little as 10 seconds. The index contains 
about 100 fields.
We are looking at switching to a different method of indexing our data 
which will involve a much larger number of fields, and very little 
stored in the index (index only) to help improve performance.

I've used the SHARDSPLIT with success, but the server doing the split 
needs to have triple the amount of direct memory when using HDFS as one 
node needs three X the amount because it will be running three shards.  
This can lead to 'swap hell' if you're not careful. On large indexes, 
the split can take a very long time to run; much longer than the REST 
timeout, but can be monitored by checking zookeeper's clusterstate.json.


On 1/7/2015 4:25 AM, Bram Van Dam wrote:
> On 01/06/2015 07:54 PM, Erick Erickson wrote:
>> Have you considered pre-supposing SolrCloud and using the SPLITSHARD
>> API command?
> I think that's the direction we'll probably be going. Index size (at 
> least for us) can be unpredictable in some cases. Some clients start 
> out small and then grow exponentially, while others start big and then 
> don't grow much at all. Starting with SolrCloud would at least give us 
> that flexibility.
> That being said, SPLITSHARD doesn't seem ideal. If a shard reaches a 
> certain size, it would be better for us to simply add an extra shard, 
> without splitting.
>> On Tue, Jan 6, 2015 at 10:33 AM, Peter Sturge 
>> <> wrote:
>>> ++1 for the automagic shard creator. We've been looking into doing this
>>> sort of thing internally - i.e. when a shard reaches a certain size/num
>>> docs, it creates 'sub-shards' to which new commits are sent and 
>>> queries to
>>> the 'parent' shard are included. The concept works, as long as you 
>>> don't
>>> try any non-dist stuff - it's one reason why all our fields are always
>>> single valued.
> Is there a problem with multi-valued fields and distributed queries?
>>> A cool side-effect of sub-sharding (for lack of a snappy term) is 
>>> that the
>>> parent shard then stops suffering from auto-warming latency due to 
>>> commits
>>> (we do a fair amount of committing). In theory, you could carry on
>>> sub-sharding until your hardware starts gasping for air.
> Sounds like you're doing something similar to us. In some cases we 
> have a hard commit every minute. Keeping the caches hot seems like a 
> very good reason to send data to a specific shard. At least I'm 
> assuming that when you add documents to a single shard and commit; the 
> other shards won't be impacted...
>  - Bram

View raw message