lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <>
Subject Re: How large is your solr index?
Date Wed, 07 Jan 2015 16:42:01 GMT
See below:

On Wed, Jan 7, 2015 at 1:25 AM, Bram Van Dam <> wrote:
> On 01/06/2015 07:54 PM, Erick Erickson wrote:
>> Have you considered pre-supposing SolrCloud and using the SPLITSHARD
>> API command?
> I think that's the direction we'll probably be going. Index size (at least
> for us) can be unpredictable in some cases. Some clients start out small and
> then grow exponentially, while others start big and then don't grow much at
> all. Starting with SolrCloud would at least give us that flexibility.
> That being said, SPLITSHARD doesn't seem ideal. If a shard reaches a certain
> size, it would be better for us to simply add an extra shard, without
> splitting.

True, and you can do this if you take explicit control of the document
routing, but...
that's quite tricky. You forever after have to send any _updates_ to the same
shard you did the first time, whereas SPLITSHARD will "do the right thing".

>> On Tue, Jan 6, 2015 at 10:33 AM, Peter Sturge <>
>> wrote:
>>> ++1 for the automagic shard creator. We've been looking into doing this
>>> sort of thing internally - i.e. when a shard reaches a certain size/num
>>> docs, it creates 'sub-shards' to which new commits are sent and queries
>>> to
>>> the 'parent' shard are included. The concept works, as long as you don't
>>> try any non-dist stuff - it's one reason why all our fields are always
>>> single valued.
> Is there a problem with multi-valued fields and distributed queries?

No. But there are some components that don't do the right thing in
distributed mode, joins for instance. The list is actually quite small and
is getting smaller all the time.

>>> A cool side-effect of sub-sharding (for lack of a snappy term) is that
>>> the
>>> parent shard then stops suffering from auto-warming latency due to
>>> commits
>>> (we do a fair amount of committing). In theory, you could carry on
>>> sub-sharding until your hardware starts gasping for air.
> Sounds like you're doing something similar to us. In some cases we have a
> hard commit every minute. Keeping the caches hot seems like a very good
> reason to send data to a specific shard. At least I'm assuming that when you
> add documents to a single shard and commit; the other shards won't be
> impacted...

Not true if the other shards have had any indexing activity. The commit is
usually forwarded to all shards. If the individual index on a
particular shard is
unchanged then it should be a no-op though.

But the usage pattern here is its own bit of a trap. If all your
indexing is going
to a single shard, then also the entire indexing _load_ is happening on that
shard. So the CPU utilization will be higher on that shard than the older ones.
Since distributed requests need to get a response from every shard before
returning to the client, the response time will be bounded by the response from
the slowest shard and this may actually be slower. Probably only noticeable
when the CPU is maxed anyway though.

>  - Bram

View raw message