jackrabbit-oak-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chetan Mehrotra (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (OAK-3436) Prevent missing checkpoint due to unstable topology from causing complete reindexing
Date Fri, 25 Sep 2015 10:11:05 GMT

     [ https://issues.apache.org/jira/browse/OAK-3436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Chetan Mehrotra updated OAK-3436:
    Attachment: OAK-3436-0.patch

[testcase patch|^OAK-3436-0.patch] for reproducing the scenario.

Had to modify {{AsyncIndexUpdate}} to allow configuring the lease timeout easily. Whats happening
is like this

{{AsyncIndexUpdate}} maintains an array of temp checkpoints which are released before indexing
is done. Upon each indexing cycle new checkpoint that gets created is pushed to this array.

Now assume you have 2 cluster nodes and each running AsyncIndexUpdate (N1 and N2). 

# Base state - Indexing done and current checkpoint stored in {{:async}} is CP1
# Async indexer starts on N2 and creates a new checkpoint CP2. At time T1 - It updates the
tmp checkpoint array with CP2 and proceed for 
indexing. Indexing is yet not complete
# At this stage async indexer starts on N1 and sees base checkpoint as CP1 with CP2 in tmp
checkpoint array. Also due to some reason default 15 min lease time has also expired. So at
time T2 this run proceeds further and removed the tmp checkpoint CP2 
# Now async indexer on N2 completes and releases CP1 (base checkpoint) and updates the {{:async}}
node with lease status
# Now async indexer on N1 also proceeds but commit fails due to concurrent update on {{:async}}
# Now at some point async indexer on N2 starts again and looks for CP2 but its has already
been removed!

If we move cleanup of tmp checkpoint to finally clause in run method this then current working
checkpoint would not get lost

[~alex.parvulescu] [~amitj_76] Thoughts?

> Prevent missing checkpoint due to unstable topology from causing complete reindexing
> ------------------------------------------------------------------------------------
>                 Key: OAK-3436
>                 URL: https://issues.apache.org/jira/browse/OAK-3436
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: query
>            Reporter: Chetan Mehrotra
>            Assignee: Chetan Mehrotra
>             Fix For: 1.3.7, 1.2.7, 1.0.22
>         Attachments: AsyncIndexUpdateClusterTest.java, OAK-3436-0.patch
> Async indexing logic relies on embedding application to ensure that async indexing job
is run as a singleton in a cluster. For Sling based apps it depends on Sling Discovery support.
At times it is being seen that if topology is not stable then different cluster nodes can
consider them as leader and execute the async indexing job concurrently.
> This can cause problem as both cluster node might not see same repository state (due
to write skew and eventual consistency) and might remove the checkpoint which other cluster
node is still relying upon. For e.g. consider a 2 node cluster N1 and N2 where both are performing
async indexing.
> # Base state - CP1 is the checkpoint for "async" job
> # N2 starts indexing and removes changes CP1 to CP2. For Mongo the checkpoints are saved
in {{settings}} collection
> # N1 also decides to execute indexing but has yet not seen the latest repository state
so still thinks that CP1 is the base checkpoint and tries to read it. However CP1 is already
removed from {{settings}} and this makes N1 think that checkpoint is missing and it decides
to reindex everything!
> To avoid this topology must be stable but at Oak level we should still handle such a
case and avoid doing a full reindexing. So we would need to have a {{MissingCheckpointStrategy}}
similar to {{MissingIndexEditorStrategy}} as done in OAK-2203 
> Possible approaches
> # A1 - Fail the indexing run if checkpoint is missing - Checkpoint being missing can
have valid reason and invalid reason. Need to see what are valid scenarios where a checkpoint
can go missing
> # A2 - When a checkpoint is created also store the creation time. When a checkpoint is
found to be missing and its a *recent* checkpoint then fail the run. For e.g. we would fail
the run till checkpoint found to be missing is less than an hour old (for just started take
startup time into account)

This message was sent by Atlassian JIRA

View raw message