jackrabbit-oak-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julian Reschke (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally
Date Tue, 21 Feb 2017 12:01:44 GMT

    [ https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15875870#comment-15875870

Julian Reschke commented on OAK-4780:

Here's an approach that might be simpler but in the end achieves the same goal:

- set a limit for the collection phase, both for elapsed time and # of documents
- when limit reached, sort the collected IDs by modified date, and compute a new upper limit
so that half of the documents become out of range; throw these entries as well
- continue the collection with the smaller time window (this just needs an internal API that
allows to specify the _id to start with)
- compute new limit for elapsed time (half of the original?)

Eventually, we should have a set of documents that we *can* garbage collect.

Finally, if maintenance window still open, just rerun the GC again.

> VersionGarbageCollector should be able to run incrementally
> -----------------------------------------------------------
>                 Key: OAK-4780
>                 URL: https://issues.apache.org/jira/browse/OAK-4780
>             Project: Jackrabbit Oak
>          Issue Type: Task
>          Components: core, documentmk
>            Reporter: Julian Reschke
>         Attachments: leafnodes.diff, leafnodes-v2.diff, leafnodes-v3.diff
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been successfully
finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is interrupted
during the path collection phase, maybe due to other maintenance tasks. On the next run, the
number of paths to be collected will be even bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in chunks;
maybe by partitioning the path space by top level directory.

This message was sent by Atlassian JIRA

View raw message