Today, LogMP allows you to set different thresholds for segments sizes,
thereby allowing you to control the largest segment that will be
considered for merge + the largest segment your index will hold (=~
threshold * mergeFactor).
So, if you want to end up w/ say 20GB segments, you can set
maxMergeMB(ForOptimize) to 2GB and mergeFactor=10.
However, this often does not achieve your desired goal -- if the index
contains 5 and 7 GB segments, they will never be merged b/c they are
bigger than the threshold. I am willing to spend the CPU and IO resources
to end up w/ 20 GB segments, whether I'm merging 10 segments together or
only 2. After I reach a 20GB segment, it can rest peacefully, at least
until I increase the threshold.
So I wonder, first, if this threshold (i.e., largest segment size you
would like to end up with) is more natural to set than thee current thresholds,
from the application level? I.e., wouldn't it be a simpler threshold to set
instead of doing weird calculus that depend on maxMergeMB(ForOptimize)
Second, should this be an addition to LogMP, or a different
type of MP. One that adheres to only those two factors (perhaps the
segSize threshold should be allowed to set differently for optimize and
regular merges). It can pick segments for merge such that it maximizes
the result segment size (i.e., don't necessarily merge in sequential
order), but not more than mergeFactor.
I guess, if we think that maxResultSegmentSizeMB is more intuitive than
the current thresholds, application-wise, then this change should go
into LogMP. Otherwise, it feels like a different MP is needed, because
LogMP is already complicated and another threshold would confuse things.
What do you think of this? Am I trying to optimize too much? :)