lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Pathological index condition
Date Tue, 08 Aug 2017 22:49:29 GMT
Hi Erick,

Some questions/answers below:

On Sun, Aug 6, 2017 at 8:22 PM, Erick Erickson <erickerickson@gmail.com>
wrote:

> Particularly interested if Mr. McCandless has any opinions here.
>
> I admit it took some work, but I can create an index that never merges
> and is 80% deleted documents using TieredMergePolicy.
>
> I'm trying to understand how indexes "in the wild" can have > 30%
> deleted documents. I think the root issue here is that
> TieredMergePolicy doesn't consider for merging any segments > 50% of
> maxMergedSegmentMB of non-deleted documents.
>
> Let's say I have segments at the default 5G max. For the sake of
> argument, it takes exactly 5,000,000 identically-sized documents to
> fill the segment to exactly 5G.
>
> IIUC, as long as the segment has more than 2,500,000 documents in it
> it'll never be eligible for merging.


That's right.


> The only way to force deleted
> docs to be purged is to expungeDeletes or optimize, neither of which
> is recommended.


+1

The condition I created was highly artificial but illustrative:
> - I set my max segment size to 20M
> - Through experimentation I found that each segment would hold roughly
> 160K synthetic docs.
> - I set my ramBuffer to 1G.
> - Then I'd index 500K docs, then delete 400K of them, and commit. This
> produces a single segment occupying (roughly) 80M of disk space, 15M
> or so of it "live" documents the rest deleted.
> - rinse, repeat with a disjoint set of doc IDs.
>
> The number of segments continues to grow forever, each one consisting
> of 80% deleted documents.
>

But wouldn't TMP at some point merge these segments?  Or are you saying
that each segments 20% of not-deleted docs is still greater than 1/2 of the
max segment size, and so TMP considers them ineligible?

This is indeed a rather pathological case, and you're right TMP would never
merge them (if my logic above is right).  Maybe we could tweak TMP for
situations like this, though I'm not sure they happen in practice.
Normally the max segment size is quite a bit larger than the initially
flushed segment sizes.


> This artificial situation just allowed me to see how the segments
> merged. Without such artificial constraints I suspect the limit for
> deleted documents would be capped at 50% theoretically and in practice
> less than that although I have seen 35% or so deleted documents in the
> wild.
>

Yeah I think so too.  I wrote this blog post about deletions:
https://www.elastic.co/blog/lucenes-handling-of-deleted-documents

It has a fun chart showing how the %tg deleted docs bounces around.


> So at the end of the day I have a couple of questions:
>
> 1> Is my understanding close to correct? This is really the first time
> I've had to dive into the guts of merging.
>

Yes!


> 2> Is there a way I've missed to slim down an index other than
> expungedeletes of optimize/forcemerge?
>

No.

It seems to me like eventually, with large indexes, every segment that
> is the max size allowed is going to have to go over 50% deletes before
> being merged and there will have to be at least two of them. I don't
> see a clean way to fix this, any algorithm would likely be far too
> expensive to be part of regular merging. I suppose we could merge
> segments of different sizes if the combined size was < max segment
> size. On a quick glance it doesn't seem like the log merge policies
> address this kind of case either, but haven't dug into them much.
>

TMP should be able to merge one max sized segment (that has eek'd just over
50% deleted docs) with smaller sized segments.  It would not prefer this
merge, since merging substantially different segment sizes is poor
performance vs. merging equally sized segments, but it does have a bias for
removing deleted docs that would offset that.


> Thanks!
>

You're welcome!

Mike McCandless

http://blog.mikemccandless.com

Mime
View raw message