jackrabbit-oak-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Parvulescu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (OAK-3362) Estimate compaction based on diff to previous compacted head state
Date Fri, 26 Feb 2016 15:11:18 GMT

    [ https://issues.apache.org/jira/browse/OAK-3362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169164#comment-15169164

Alex Parvulescu commented on OAK-3362:

an update on the GC Deltas, and a new issue I ran into.

* GC Deltas: as it turns out if you consider the checkpoints as snapshots in time from the
gc reference, up to the current head, ordered by creation time, you no longer need to have
incremental diffs between all revisions, you can just diff by intervals and you'll get a close
enough estimation of garbage. 
To further explain the point: given [ref, cp0, cp1, head], where the _ref_ is the revision
where compaction ran last, _cp0_ and _cp1_ are the removed checkpoints (we're effectively
ignoring added cps), and _head_ is the current head state, if you only diff [ref, head] you
can miss out of some intermediary updates on the same path (think indexing). as it turns out,
a much better estimation of garbage is simply splitting the large diff over intervals: diff[ref,
cp0] + diff[cp0, cp1] + diff [cp1, head]. it is still an estimation but I think it is good

* The issue that comes up next is what happens when the _ref_ state, represents a compaction
run that was not efficient, meaning there's still garbage left (lots of inmemory refs that
can't be cleared and such). in this case the delta will only estimate garbage _since_ that
revision, so it might not reflect a very good state. I can't yet tell if this will be a problem
in real life or not.

> Estimate compaction based on diff to previous compacted head state
> ------------------------------------------------------------------
>                 Key: OAK-3362
>                 URL: https://issues.apache.org/jira/browse/OAK-3362
>             Project: Jackrabbit Oak
>          Issue Type: New Feature
>          Components: segmentmk
>            Reporter: Alex Parvulescu
>            Assignee: Alex Parvulescu
>            Priority: Minor
>              Labels: compaction, gc
>             Fix For: 1.6
> Food for thought: try to base the compaction estimation on a diff between the latest
compacted state and the current state.
> Pros
> * estimation duration would be proportional to number of changes on the current head
> * using the size on disk as a reference, we could actually stop the estimation early
when we go over the gc threshold.
> * data collected during this diff could in theory be passed as input to the compactor
so it could focus on compacting a specific subtree
> Cons
> * need to keep a reference to a previous compacted state. post-startup and pre-compaction
this might prove difficult (except maybe if we only persist the revision similar to what the
async indexer is doing currently)
> * coming up with a threshold for running compaction might prove difficult
> * diff might be costly, but still cheaper than the current full diff

This message was sent by Atlassian JIRA

View raw message