jackrabbit-oak-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Dürig (JIRA) <j...@apache.org>
Subject [jira] [Commented] (OAK-3348) Cross gc sessions might introduce references to pre-compacted segments
Date Fri, 11 Mar 2016 13:22:56 GMT

    [ https://issues.apache.org/jira/browse/OAK-3348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15190912#comment-15190912

Michael Dürig commented on OAK-3348:

After much research I think I finally came up with a way to get rid of references to pre
compacted segments. Please have a look at my respective [GitHub branch|https://github.com/mduerig/jackrabbit-oak/tree/OAK-3348.poc].

There is a *HUGE* amount of changes. Around 10 kloc as [patch|https://github.com/apache/jackrabbit-oak/compare/461f113c7bf7bec3fa3626cca84db6619e5fc372...c269193ee4b824b62c2d03e9f097b3a0dc0e61df.patch].

I structured the changes into various commits relating similar concerns as summarised in the
individual commit messages. Most of the time I resisted unrelated refactorings as not to
complicate things further. Instead I just left [FIXME michid|https://github.com/mduerig/jackrabbit-oak/commit/c269193ee4b824b62c2d03e9f097b3a0dc0e61df]
tags into the code. However, sometimes I just had to refactor stuff in order to get a
clearer vision. Also I probably broke some things not immediately concerned (like
upgrade, backup, restore, standby, upgrade and tooling). I left the same tags in the code
those places I knew would need following up.

The core of this patch comes from the realisation that the compaction map is serving three
different concerns: deduplication, equality (of node states across different GC generations)
tracking GC generations. Furthermore for equality and GC generations the compaction map
served as an external structure that had to be kept in sync with the records and segments
elsewhere. This resulted in overly complex coordination and synchronisation of global state,
which was nearly impossible to get right. In my patch I completely separate those three concerns
allowing me to get rid of the compaction map entirely.

* The GC generation is now written into the segment header and segments can only contain records
  pertaining to that generation. The current generation is then simply the generation of the
  segment containing the current head node state. As a result incrementing the generation
is now
  implicit with {{SegmentStore.setHead}} and thus free of races. It happens when the result
  the {{Compactor}} is applied.

* Equality of segment node states is now id based: each {{SegmentNodeState}} contains an id.
  Equality of the id is a necessary condition for equality of those node states. Again, as
this id
  is persisted along with the states, we don't have to rely on and maintain external structures
  (like the compaction map) that are prone to races.

* As a direct consequence of not having to rely on an external map for equality, we can tweak
  the map we need for deduplication: there is no need for fully persistent keys anymore and
  can use a cache instead. This has the advantage that we can specifically teak the cache
  this single use case and adapt it to the available memory. That is, trading off the
  efficiency of deduplication. Deduplication is done in the {{SegmentWriter}} where I added
  a special cache for node states that gives precedence to nodes high up in the tree.

The patch contains quite a number of [FIXME michid|https://github.com/mduerig/jackrabbit-oak/commit/c269193ee4b824b62c2d03e9f097b3a0dc0e61df]
tags for things that need fixing/evaluating/analysing/improving. Most prominently:

* We need a good mechanism for creating ids for segment node states. Currently I use a counter,
  which I consider good enough for the POC but not fit for production.

* What is the effect of those additional ids on repository size?

* Having to read the GC generation from the segment potentially adds some overhead as we now
  to map the segment into memory where before it was sufficient to look at its storage address.
  Maybe we ca write the generation into the tar file index!?

* Validate and optimise the segment node state deduplication cache. What sizing works? Do
  also need to take the number of child node into account or is the position in the hierarchy
  sufficient? Do we need pinning (for checkpoints)?

* Do we need a cache for de-duplicating binaries? We didn't have this so far expect for in
  the compactor. As the latter now relies on deduplication done in the segment writer this
  something we should look into. OTOH, I improved deduplication for large binaries by rewriting
  the list of its block ids when copying a segment stream. As we should recommend using a
  store for working with large binaries this could actually be sufficient.

* I particularly dislike the way the deduplication cache is handed from the compactor to the
  segment writer. We should refactor this part into a clean solution. Also cache eviction
  currently handled from outside (for all record caches actually). It would be cleaner to
  this from inside the caches themselves.

* We need good monitoring for these caches.

Further down the line we also need to think about upgrading existing repositories as obviously
there are changes to the segment format here.

> Cross gc sessions might introduce references to pre-compacted segments
> ----------------------------------------------------------------------
>                 Key: OAK-3348
>                 URL: https://issues.apache.org/jira/browse/OAK-3348
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: segmentmk
>            Reporter: Michael Dürig
>            Assignee: Michael Dürig
>              Labels: candidate_oak_1_0, candidate_oak_1_2, cleanup, compaction, gc
>             Fix For: 1.6
>         Attachments: OAK-3348-1.patch, OAK-3348-2.patch, OAK-3348.patch, cross-gc-refs.pdf,
> I suspect that certain write operations during compaction can cause references from compacted
segments to pre-compacted ones. This would effectively prevent the pre-compacted segments
from getting evicted in subsequent cleanup phases. 
> The scenario is as follows:
> * A session is opened and a lot of content is written to it such that the update limit
is exceeded. This causes the changes to be written to disk. 
> * Revision gc runs causing a new, compacted root node state to be written to disk.
> * The session saves its changes. This causes rebasing of its changes onto the current
root (the compacted one). At this point any node that has been added will be added again in
the sub-tree rooted at the current root. Such nodes however might have been written to disk
*before* revision gc ran and might thus be contained in pre-compacted segments. As I suspect
the node-add operation in the rebasing process *not* to create a deep copy of such nodes but
to rather create a *reference* to them, a reference to a pre-compacted segment is introduced
> Going forward we need to validate above hypothesis, assess its impact if necessary come
up with a solution.

This message was sent by Atlassian JIRA

View raw message