lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: How large is your solr index?
Date Wed, 07 Jan 2015 21:32:52 GMT
You shouldn't _have_ to keep track of this yourself since Solr 4.4,
see SOLR-4965 and the associated Lucene JIRA. Those are supposed to
make issuing a commit on an index that hasn't changed a no-op.

If you do issue commits and do open new searchers when the index has
NOT changed, it's worth a JIRA.

FWIW,
Erick

On Wed, Jan 7, 2015 at 1:17 PM, Peter Sturge <peter.sturge@gmail.com> wrote:
>> Is there a problem with multi-valued fields and distributed queries?
>
>> No. But there are some components that don't do the right thing in
>> distributed mode, joins for instance. The list is actually quite small and
>> is getting smaller all the time.
>
> Yes, joins is the main one. There used to be some dist constraints on
> grouping, but that might be from the 3.x days of field collapsing.
>
>> Sounds like you're doing something similar to us. In some cases we have a
>> hard commit every minute. Keeping the caches hot seems like a very good
>> reason to send data to a specific shard. At least I'm assuming that when
> you
>> add documents to a single shard and commit; the other shards won't be
>> impacted...
>
>> Not true if the other shards have had any indexing activity. The commit is
>> usually forwarded to all shards. If the individual index on a
>> particular shard is
>> unchanged then it should be a no-op though.
>
> This is an excellent point, and well worth taking some care on.
> We do it by indexing to a number of shards, and only commit to those that
> actually have something to commit - although an empty commit might be a
> no-op on the indexing side, it's not on the automwarming/faceting side -
> care needs to be taken so that you don't hose your caches unnecessarily.
>
>
> On Wed, Jan 7, 2015 at 4:42 PM, Erick Erickson <erickerickson@gmail.com>
> wrote:
>
>> See below:
>>
>>
>> On Wed, Jan 7, 2015 at 1:25 AM, Bram Van Dam <bram.vandam@intix.eu> wrote:
>> > On 01/06/2015 07:54 PM, Erick Erickson wrote:
>> >>
>> >> Have you considered pre-supposing SolrCloud and using the SPLITSHARD
>> >> API command?
>> >
>> >
>> > I think that's the direction we'll probably be going. Index size (at
>> least
>> > for us) can be unpredictable in some cases. Some clients start out small
>> and
>> > then grow exponentially, while others start big and then don't grow much
>> at
>> > all. Starting with SolrCloud would at least give us that flexibility.
>> >
>> > That being said, SPLITSHARD doesn't seem ideal. If a shard reaches a
>> certain
>> > size, it would be better for us to simply add an extra shard, without
>> > splitting.
>> >
>>
>> True, and you can do this if you take explicit control of the document
>> routing, but...
>> that's quite tricky. You forever after have to send any _updates_ to the
>> same
>> shard you did the first time, whereas SPLITSHARD will "do the right thing".
>>
>> >
>> >> On Tue, Jan 6, 2015 at 10:33 AM, Peter Sturge <peter.sturge@gmail.com>
>> >> wrote:
>> >>>
>> >>> ++1 for the automagic shard creator. We've been looking into doing this
>> >>> sort of thing internally - i.e. when a shard reaches a certain size/num
>> >>> docs, it creates 'sub-shards' to which new commits are sent and queries
>> >>> to
>> >>> the 'parent' shard are included. The concept works, as long as you
>> don't
>> >>> try any non-dist stuff - it's one reason why all our fields are always
>> >>> single valued.
>> >
>> >
>> > Is there a problem with multi-valued fields and distributed queries?
>>
>> No. But there are some components that don't do the right thing in
>> distributed mode, joins for instance. The list is actually quite small and
>> is getting smaller all the time.
>>
>> >
>> >>> A cool side-effect of sub-sharding (for lack of a snappy term) is that
>> >>> the
>> >>> parent shard then stops suffering from auto-warming latency due to
>> >>> commits
>> >>> (we do a fair amount of committing). In theory, you could carry on
>> >>> sub-sharding until your hardware starts gasping for air.
>> >
>> >
>> > Sounds like you're doing something similar to us. In some cases we have a
>> > hard commit every minute. Keeping the caches hot seems like a very good
>> > reason to send data to a specific shard. At least I'm assuming that when
>> you
>> > add documents to a single shard and commit; the other shards won't be
>> > impacted...
>>
>> Not true if the other shards have had any indexing activity. The commit is
>> usually forwarded to all shards. If the individual index on a
>> particular shard is
>> unchanged then it should be a no-op though.
>>
>> But the usage pattern here is its own bit of a trap. If all your
>> indexing is going
>> to a single shard, then also the entire indexing _load_ is happening on
>> that
>> shard. So the CPU utilization will be higher on that shard than the older
>> ones.
>> Since distributed requests need to get a response from every shard before
>> returning to the client, the response time will be bounded by the response
>> from
>> the slowest shard and this may actually be slower. Probably only noticeable
>> when the CPU is maxed anyway though.
>>
>>
>>
>> >
>> >  - Bram
>> >
>>

Mime
View raw message