lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Pathological index condition
Date Mon, 28 Aug 2017 20:43:30 GMT
bq: I guess the alternative would be to occasionally roll the dice and
decide to merge that kind of segment.

That's what I was getting to  with the "autoCompact" idea in a more
deterministic manner.



On Mon, Aug 28, 2017 at 1:32 PM, Walter Underwood <wunder@wunderwood.org> wrote:
> That makes sense.
>
> I guess the alternative would be to occasionally roll the dice and decide to
> merge that kind of segment.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> On Aug 28, 2017, at 1:28 PM, Erick Erickson <erickerickson@gmail.com> wrote:
>
> I don't think jitter would help. As long as a segment has > 50% max
> segment size "live" docs, it's forever ineligible for merging (outside
> optimize of expungeDeletes commands). So the "zone" is anything over
> 50%.
>
> Or I missed your point.
>
> Erick
>
> On Mon, Aug 28, 2017 at 12:50 PM, Walter Underwood
> <wunder@wunderwood.org> wrote:
>
> If this happens in a precise zone, how about adding some random jitter to
> the threshold? That tends to get this kind of lock-up unstuck.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> On Aug 28, 2017, at 12:44 PM, Erick Erickson <erickerickson@gmail.com>
> wrote:
>
> And one more thought (not very well thought out).
>
> A parameter on TMP (or whatever) that did <3> something like:
>
> a parameter <autoCompactTime>
> a parameter <autoCompactPct>
> On startup TMP takes the current timestamp
>
> *> Every minute (or whatever) it checks the current timestamp and if
> <autoCompactTime> is in between the last check time and now, do <2>.
>
> set the last checked time to the value from * above.
>
>
> Taking the current timestamp would keep from kicking of the compaction
> on startup, so we wouldn't need to keep some stateful information
> across restarts and wouldn't go into a compact cycle on startup.
>
> Erick
>
> On Sun, Aug 27, 2017 at 11:31 AM, Erick Erickson
> <erickerickson@gmail.com> wrote:
>
> I've been thinking about this a little more. Since this is an outlier,
> I'm loathe to change the core TMP merge selection process. Say the max
> segment size if 5G. You'd be doing an awful lot of I/O to merge a
> segment with 4.75G "live" docs with one with 0.25G. Plus that doesn't
> really allow users who issue the tempting "optimize" command to
> recover; that one huge segment can hang around for a _very_ long time,
> accumulating lots of deleted docs. Same with expungeDeletes.
>
> I can think of several approaches:
>
> 1> despite my comment above, a flag that says something like "if a
> segment has > X% deleted docs, merge it with a smaller segment anyway
> respecting the max segment size. I know, I know this will affect
> indexing throughput, do it anyway".
>
> 2> A special op (or perhaps a flag on expungeDeletes) that would
> behave like <1> but on-demand rather than part of standard merging.
>
> In both of these cases, if a segment had > X% deleted docs but the
> live doc size for that segment was > the max seg size, rewrite it into
> a single new segment removing deleted docs.
>
> 3> some way to do the above on a schedule. My notion is something like
> a maintenance window at 1:00 AM. You'd still have a live collection,
> but (presumably) a way to purge the day's accumulation of deleted
> documents during off hours.
>
> 4> ???
>
> I probably like <2> best so far, I don't see this condition in the
> wild very often it usually occurs during heavy re-indexing operations
> and often after an optimize or expungeDeletes has happened. <1> could
> get horribly pathological if the threshold was 1% or something.
>
> WDYT?
>
>
> On Wed, Aug 9, 2017 at 2:40 PM, Erick Erickson <erickerickson@gmail.com>
> wrote:
>
> Thanks Mike:
>
> bq: Or are you saying that each segments 20% of not-deleted docs is
> still greater than 1/2 of the max segment size, and so TMP considers
> them ineligible?
>
> Exactly.
>
> Hadn't seen the blog, thanks for that. Added to my list of things to refer
> to.
>
> The problem we're seeing is that "in the wild" there are cases where
> people can now get satisfactory performance from huge numbers of
> documents, as in close to 2B (there was a question on the user's list
> about that recently). So allowing up to 60% deleted documents is
> dangerous in that situation.
>
> And the situation is exacerbated by optimizing (I know, "don't do that").
>
> Ah, well, the joys of people using this open source thing and pushing
> its limits.
>
> Thanks again,
> Erick
>
> On Tue, Aug 8, 2017 at 3:49 PM, Michael McCandless
> <lucene@mikemccandless.com> wrote:
>
> Hi Erick,
>
> Some questions/answers below:
>
> On Sun, Aug 6, 2017 at 8:22 PM, Erick Erickson <erickerickson@gmail.com>
> wrote:
>
>
> Particularly interested if Mr. McCandless has any opinions here.
>
> I admit it took some work, but I can create an index that never merges
> and is 80% deleted documents using TieredMergePolicy.
>
> I'm trying to understand how indexes "in the wild" can have > 30%
> deleted documents. I think the root issue here is that
> TieredMergePolicy doesn't consider for merging any segments > 50% of
> maxMergedSegmentMB of non-deleted documents.
>
> Let's say I have segments at the default 5G max. For the sake of
> argument, it takes exactly 5,000,000 identically-sized documents to
> fill the segment to exactly 5G.
>
> IIUC, as long as the segment has more than 2,500,000 documents in it
> it'll never be eligible for merging.
>
>
>
> That's right.
>
>
> The only way to force deleted
> docs to be purged is to expungeDeletes or optimize, neither of which
> is recommended.
>
>
>
> +1
>
> The condition I created was highly artificial but illustrative:
> - I set my max segment size to 20M
> - Through experimentation I found that each segment would hold roughly
> 160K synthetic docs.
> - I set my ramBuffer to 1G.
> - Then I'd index 500K docs, then delete 400K of them, and commit. This
> produces a single segment occupying (roughly) 80M of disk space, 15M
> or so of it "live" documents the rest deleted.
> - rinse, repeat with a disjoint set of doc IDs.
>
> The number of segments continues to grow forever, each one consisting
> of 80% deleted documents.
>
>
>
> But wouldn't TMP at some point merge these segments?  Or are you saying that
> each segments 20% of not-deleted docs is still greater than 1/2 of the max
> segment size, and so TMP considers them ineligible?
>
> This is indeed a rather pathological case, and you're right TMP would never
> merge them (if my logic above is right).  Maybe we could tweak TMP for
> situations like this, though I'm not sure they happen in practice.  Normally
> the max segment size is quite a bit larger than the initially flushed
> segment sizes.
>
>
> This artificial situation just allowed me to see how the segments
> merged. Without such artificial constraints I suspect the limit for
> deleted documents would be capped at 50% theoretically and in practice
> less than that although I have seen 35% or so deleted documents in the
> wild.
>
>
>
> Yeah I think so too.  I wrote this blog post about deletions:
> https://www.elastic.co/blog/lucenes-handling-of-deleted-documents
>
> It has a fun chart showing how the %tg deleted docs bounces around.
>
>
> So at the end of the day I have a couple of questions:
>
> 1> Is my understanding close to correct? This is really the first time
> I've had to dive into the guts of merging.
>
>
>
> Yes!
>
>
> 2> Is there a way I've missed to slim down an index other than
> expungedeletes of optimize/forcemerge?
>
>
>
> No.
>
> It seems to me like eventually, with large indexes, every segment that
> is the max size allowed is going to have to go over 50% deletes before
> being merged and there will have to be at least two of them. I don't
> see a clean way to fix this, any algorithm would likely be far too
> expensive to be part of regular merging. I suppose we could merge
> segments of different sizes if the combined size was < max segment
> size. On a quick glance it doesn't seem like the log merge policies
> address this kind of case either, but haven't dug into them much.
>
>
>
> TMP should be able to merge one max sized segment (that has eek'd just over
> 50% deleted docs) with smaller sized segments.  It would not prefer this
> merge, since merging substantially different segment sizes is poor
> performance vs. merging equally sized segments, but it does have a bias for
> removing deleted docs that would offset that.
>
>
> Thanks!
>
>
>
> You're welcome!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message