If you're going to do frequent commits or refreshes it means you care less about performance and more about durability and low-latency searching.

In that case, it's best to index with only 1 thread: then every commit/refresh will only write one segment and you minimize the merge pressure.

Adrien is right that those recently created segments will be fast to merge since they will typically be hot in the OS cache.

And yes this is an evil adversary to the merge policy and scheduler!!


On Tue, Aug 1, 2017 at 1:39 PM, Erick Erickson <erickerickson@gmail.com> wrote:
IIUC, segments are actually written the first time when the
ramBufferSizeMB is exceeded. If you can afford it you might increase
that number. NOTE: I'm going from memory here so you should check....

That doesn't really address merging segments with deleted docs though.
I do wonder what happens if you bump the segments per tier. My guess:
less frequent but more intense merges so what the overall effect is is
unclear.

Best,
Erick

On Tue, Aug 1, 2017 at 8:00 AM, Walter Underwood <wunder@wunderwood.org> wrote:
> Optimizing for frequent changes sounds like a caching strategy, maybe “LRU
> merging”. Perhaps prefer merging segments that have not changed in a while?
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> On Aug 1, 2017, at 5:50 AM, Tommaso Teofili <tommaso.teofili@gmail.com>
> wrote:
>
>
>
> Il giorno mar 1 ago 2017 alle ore 14:04 Adrien Grand <jpountz@gmail.com> ha
> scritto:
>>
>> The trade-off does not sound simple to me. This approach could lead to
>> having more segments overall, making search requests and updates potentially
>> slower and more I/O-intensive since they have to iterate over more segments?
>> I'm not saying this is a bad idea, but it could have unexpected
>> side-effects.
>
>
> yes, that's my same concern.
>
>>
>>
>> Do you actually have a high commit rate or a high reopen rate
>> (DirectoryReader.open(IndexWriter))?
>
>
> in my scenario both, but commit rate is much superseding reopening.
>
>>
>> Maybe reopening instead of committing (and still committing, but less
>> frequently) would decrease the I/O load since NRT segments might never need
>> to be actually written to disk if they are merged before the next commit
>> happens and you give enough memory to the filesystem cache.
>
>
> makes sense in general, however I am a bit constrained in how much I can
> avoid committing (states from an MVCC systems are tight to commits, so it's
> trickier).
>
> In general I was wondering if we could have the merge policy look at both
> commit rate and no. of segments and decide whether to merge or not based on
> both, so that if the segments growth is within a threshold we possibly save
> some merges when we have high commit rates, but as you say we may have to do
> bigger merges then.
> I can imagine this to make more sense when a lot of tiny changes are made to
> the index rather than a few big ones (then the bigger merges problem should
> be less significant).
>
> Other than my specific scenario, I am thinking that we can look again at the
> current MP algorithm and see if we can improve it, or make it more flexible
> to the way the "sneaky opponent" (Mike's ™ [1]) behaves.
>
> My 2 cents,
> Tommaso
>
> [1] :
> http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html
>
>>
>>
>> Le mar. 1 août 2017 à 10:59, Tommaso Teofili <tommaso.teofili@gmail.com> a
>> écrit :
>>>
>>> Hi all,
>>>
>>> lately I am looking a bit closer at merge policies, of course
>>> particularly at the tiered one, and I was wondering if we can mitigate the
>>> amount of possibly avoidable merges in high commit rates scenarios,
>>> especially when a high percentage of the commits happens on same docs.
>>> I've observed several evolutions of merges in such scenarios and it
>>> seemed to me the merge policy was too aggressive in merging, causing a large
>>> IO overhead.
>>> I've then tried the same with a merge policy which was tentatively
>>> looking at commit rates and skipping merges if such a rate is higher than a
>>> threshold which seemed to give slightly better results in reducing the
>>> unneeded IO caused by avoidable merges.
>>>
>>> I know this is a bit abstract but I would like to know if anyone has any
>>> ideas or plans about mitigating the merge overhead in general and / or in
>>> similar cases.
>>>
>>> Regards,
>>> Tommaso
>>>
>>>
>>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org