lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-8263) Add indexPctDeletedTarget as a parameter to TieredMergePolicy to control more aggressive merging
Date Sat, 07 Jul 2018 15:24:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-8263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16535798#comment-16535798
] 

Erick Erickson commented on LUCENE-8263:
----------------------------------------

[~jpountz] All help welcome of course! The code changes for this one are actually pretty minimal,
at least given my first go at it a while ago. The effects, well, we'll see.

Do be careful of one thing though, when you forceMerge and specify N segments where N >
1, TMP no longer guarantees that there'll be exactly N segments when done. IT does a "best
effort" single-pass attempt to avoid rewriting segments uselessly.

FWIW, I ran 2,000 beasts of LUCENE-8383 and LUCENE-8385 combined last night and they all passed.
I'll be happy to help beast any interim additional tests you come up with.

> Add indexPctDeletedTarget as a parameter to TieredMergePolicy to control more aggressive
merging
> ------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-8263
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8263
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Major
>
> Spinoff of LUCENE-7976 to keep the two issues separate.
> The current TMP allows up to 50% deleted docs, which can be wasteful on large indexes.
This parameter will do more aggressive merging of segments with deleted documents when the
_total_ percentage of deleted docs in the entire index exceeds it.
> Setting this to 50% should approximate current behavior. Setting it to 20% caused the
first cut at this to increase I/O roughly 10%. Setting it to 10% caused about a 50% increase
in I/O.
> I was conflating the two issues, so I'll change 7976 and comment out the bits that reference
this new parameter. After it's checked in we can bring this back. That should be less work
than reconstructing this later.
> Among the questions to be answered:
> 1> what should the default be? I propose 20% as it results in significantly less space
wasted and helps control heap usage for a modest increase in I/O.
> 2> what should the floor be? I propose 10% with _strong_ documentation warnings about
not setting it below 20%.
> 3> should there be two parameters? I think this was discussed somewhat in 7976. The
first cut at  this used this number for two purposes:
> 3a> the total percentage of deleted docs index-wide to trip this trigger
> 3b> the percentage of an _individual_ segment that had to be deleted if the segment
was over maxSegmentSize/2 bytes in order to be eligible for merging. Empirically, using the
same percentage for both caused the merging to hover around the value specified for this parameter.
> My proposal for <3> would be to have the parameter do double-duty. Assuming my
preliminary results hold, you specify this parameter at, say, 20% and once the index hits
that % deleted docs it hovers right around there, even if you've forceMerged earlier down
to 1 segment. This seems in line with what I'd expect and adding another parameter seems excessively
complicated to no good purpose. We could always add something like that later if we wanted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message