lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bram Van Dam <bram.van...@intix.eu>
Subject Re: How large is your solr index?
Date Thu, 08 Jan 2015 11:37:19 GMT
On 01/07/2015 05:42 PM, Erick Erickson wrote:
> True, and you can do this if you take explicit control of the document
> routing, but...
> that's quite tricky. You forever after have to send any _updates_ to the same
> shard you did the first time, whereas SPLITSHARD will "do the right thing".

Hmm. That is a good point. I wonder if there's some kind of middle 
ground here? Something that lets me send an update (or new document) to 
an arbitrary node/shard but which is still routed according to my 
specific requirements? Maybe this can already be achieved by messing 
with the routing?

> <snip> there are some components that don't do the right thing in
> distributed mode, joins for instance. The list is actually quite small and
> is getting smaller all the time.


That's fine. We have a lot of query (pre-)processing outside of Solr. 
It's no problem for us to send a couple of queries to a couple of shards 
and aggregate the result ourselves. It would, of course, be nice if 
everything worked in distributed mode, but at least for us it's not an 
issue. This is a side effect of our complex reporting requirements -- we 
do aggregation, filtering and other magic on data that is partially in 
Solr and partially elsewhere.

> Not true if the other shards have had any indexing activity. The commit is
> usually forwarded to all shards. If the individual index on a
> particular shard is
> unchanged then it should be a no-op though.

I think a no-op commit no longer clears the caches either, so that's great.

> But the usage pattern here is its own bit of a trap. If all your
> indexing is going
> to a single shard, then also the entire indexing _load_ is happening on that
> shard. So the CPU utilization will be higher on that shard than the older ones.
> Since distributed requests need to get a response from every shard before
> returning to the client, the response time will be bounded by the response from
> the slowest shard and this may actually be slower. Probably only noticeable
> when the CPU is maxed anyway though.

This is a very good point. But I don't think SPLITSHARD is the magical 
answer here. If you have N shards on N boxes, and they are all getting 
nearly "full" and you decide to split one and move half to a new box, 
you'll end up with N-2 nearly full boxes and 2 half-full boxes. What 
happens if the disks fill up further? Do I have to split each shard? 
That sounds pretty nightmareish!

  - Bram

Mime
View raw message