jackrabbit-oak-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefan Eissing (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (OAK-4780) VersionGarbageCollector should be able to run incrementally
Date Wed, 08 Mar 2017 16:49:38 GMT

    [ https://issues.apache.org/jira/browse/OAK-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15901533#comment-15901533

Stefan Eissing commented on OAK-4780:

Updated https://github.com/apache/jackrabbit-oak/compare/trunk...icing:revision-garbage-collector?expand=1
with some of the promised changes. Most relevant, this patch now adds the command 'revisions'
to the oak-run suite.

{{java -jar target/oak-run-1.8-SNAPSHOT.jar revisions mongodb://host/dbname}}

Without further arguments will just print information about last run, recommended parameters
and number of delete candidates found. This is the command {{info}}, which is default.

{{collect}} will run the real revision garbage collection. There are some options, for example
{{--once}} to only run a single iteration. The other command currently supported is {{reset}}
which clears all persisted meta information from rgc.

On a sample customer base with ~250 million nodes and ~25 million delete candidates overall,
a query for candidates which does not select any node runs for about 10 minutes on my developer
machine on this database. My initial algorithm to find the oldest time did finish, but took
over an hour. I made a mongo specific query implementation which takes about 3 minutes to
find the same result. Since this is normally only run once, this seems fine.

It now runs here with unlimited iterations. I will report back tomorrow how it went.

> VersionGarbageCollector should be able to run incrementally
> -----------------------------------------------------------
>                 Key: OAK-4780
>                 URL: https://issues.apache.org/jira/browse/OAK-4780
>             Project: Jackrabbit Oak
>          Issue Type: Task
>          Components: core, documentmk
>            Reporter: Julian Reschke
>         Attachments: leafnodes.diff, leafnodes-v2.diff, leafnodes-v3.diff
> Right now, the documentmk's version garbage collection runs in several phases.
> It first collects the paths of candidate nodes, and only once this has been successfully
finished, starts actually deleting nodes.
> This can be a problem when the regularly scheduled garbage collection is interrupted
during the path collection phase, maybe due to other maintenance tasks. On the next run, the
number of paths to be collected will be even bigger, thus making it even more likely to fail.
> We should think about a change in the logic that would allow the GC to run in chunks;
maybe by partitioning the path space by top level directory.

This message was sent by Atlassian JIRA

View raw message