jackrabbit-oak-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chetan Mehrotra (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (OAK-1341) DocumentNodeStore: Implement revision garbage collection
Date Mon, 31 Mar 2014 05:24:15 GMT

    [ https://issues.apache.org/jira/browse/OAK-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13941849#comment-13941849
] 

Chetan Mehrotra edited comment on OAK-1341 at 3/31/14 5:24 AM:
---------------------------------------------------------------

Based on past discussion with [~mreutegg] and [~tmueller] following areas needs to be accounted
by GC logic

*Garbage Types*
Currently revision garbage gets created under following areas

# Deleted documents - If a document is deleted it is currently not removed in persistent layer.
OAK-1557 helps here
# Split Documents - Documents are split as they grow in size. The split document can be of
following types. Further all revision entries in the split doc are older than the revision
of split doc
## SD1 - Document contains commit entries in {{_revision}} and all other property history.

## SD2 - Document only contains entries for various properties which have got updated over
time
## SD3 - Document contains both {{_revision}} and property entries but had no child when it
was split
## SD4 - Document is an intermediate document created as part of cascading split doc support
(OAK-1342)
# Primary Document old revision - If a document is not split then also it might contains old
revision entries for properties and commits

Of above #1 and #1.2 ,#1.3, #1.4 can be safely removed completely if there revision are older.


*Deleting Garbage related to Commit records*
Deleting old commit records would be tricky as it becomes tricky to distinguish between a
failed/unmrged commit and old commit.

Further the GC logic also has to honour any checkpoints registered with the NodeStore (OAK-1586)

Of above #1 and #1.2 ,#1.3, #1.4 can be safely removed completely if there revision are older.



was (Author: chetanm):
Based on past discussion with [~mreutegg] and [~tmueller] following areas needs to be accounted
by GC logic

*Garbage Types*
Currently revision garbage gets created under following areas

# Deleted documents - If a document is deleted it is currently not removed in persistent layer.
OAK-1557 helps here
# Split Documents - Documents are split as they grow in size. The split document can be of
following types. Further all revision entries in the split doc are older than the revision
of split doc
## SD1 - Document contains commit entries in {{_revision}} and all other property history.

## SD2 - Document only contains entries for various properties which have got updated over
time
## SD3 - Document contains both {{_revision}} and property entries but had no child when it
was split
## SD4 - Document is an intermediate document created as part of cascading split doc support
(OAK-1342)
# Primary Document old revision - If a document is not split then also it might contains old
revision entries for properties and commits

Of above #1 and #1.2 ,#1.3, #1.4 can be safely removed completely if there revision are older.


*Deleting Garbage related to Commit records*
Deleting old commit records would be tricky as it becomes tricky to distinguish between a
failed/unmrged commit and old commit.

Further the GC logic also has to honour any checkpoints registered with the NodeStore (OAK-1586)

So for now would aim for #1 and #1.2

> DocumentNodeStore: Implement revision garbage collection
> --------------------------------------------------------
>
>                 Key: OAK-1341
>                 URL: https://issues.apache.org/jira/browse/OAK-1341
>             Project: Jackrabbit Oak
>          Issue Type: Sub-task
>          Components: mongomk
>            Reporter: Thomas Mueller
>            Assignee: Chetan Mehrotra
>            Priority: Minor
>             Fix For: 0.20
>
>
> For the MongoMK (as well as for other storage engines that are based on it), garbage
collection is most easily implemented by iterating over all documents and removing unused
entries (either whole documents, or data within the document). 
> Iteration can be done in parallel (for example one process per shard), and it can be
done in any order. 
> The most efficient order is probably the id order; however, it might be better to iterate
only over documents that were not changed recently, by using the index on the "_modified"
property. That way we don't need to iterate over the whole repository over and over again,
but just over those documents that were actually changed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message