jackrabbit-oak-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chetan Mehrotra (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (OAK-1341) DocumentNodeStore: Implement revision garbage collection
Date Thu, 20 Mar 2014 15:38:44 GMT

    [ https://issues.apache.org/jira/browse/OAK-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13941849#comment-13941849

Chetan Mehrotra commented on OAK-1341:

Based on past discussion with [~mreutegg] and [~tmueller] following areas needs to be accounted
by GC logic

*Garbage Types*
Currently revision garbage gets created under following areas

# Deleted documents - If a document is deleted it is currently not removed in persistent layer.
OAK-1557 helps here
# Split Documents - Documents are split as they grow in size. The split document can be of
following types. Further all revision entries in the split doc are older than the revision
of split doc
## SD1 - Document only contains commit entries in {{_revision}}. 
## SD2 - Document only contains entries for various properties which have got updated over
## SD3 - Document contains both {{_revision}} and property entries
# Primary Document old revision - If a document is not split then also it might contains old
revision entries for properties and commits

Of above #1 and #1.2 can be safely removed completely if there revision are older. 

*Deleting Garbage related to Commit records*
Deleting old commit records would be tricky as it becomes tricky to distinguish between a
failed/unmrged commit and old commit.

Further the GC logic also has to honour any checkpoints registered with the NodeStore (OAK-1586)

So for now would aim for #1 and #1.2

> DocumentNodeStore: Implement revision garbage collection
> --------------------------------------------------------
>                 Key: OAK-1341
>                 URL: https://issues.apache.org/jira/browse/OAK-1341
>             Project: Jackrabbit Oak
>          Issue Type: Sub-task
>          Components: mongomk
>            Reporter: Thomas Mueller
>            Assignee: Chetan Mehrotra
>            Priority: Minor
>             Fix For: 0.20
> For the MongoMK (as well as for other storage engines that are based on it), garbage
collection is most easily implemented by iterating over all documents and removing unused
entries (either whole documents, or data within the document). 
> Iteration can be done in parallel (for example one process per shard), and it can be
done in any order. 
> The most efficient order is probably the id order; however, it might be better to iterate
only over documents that were not changed recently, by using the index on the "_modified"
property. That way we don't need to iterate over the whole repository over and over again,
but just over those documents that were actually changed.

This message was sent by Atlassian JIRA

View raw message