jackrabbit-oak-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chetan Mehrotra (JIRA)" <j...@apache.org>
Subject [jira] [Created] (OAK-3436) Prevent missing checkpoint due to unstable topology from causing complete reindexing
Date Tue, 22 Sep 2015 04:50:04 GMT
Chetan Mehrotra created OAK-3436:
------------------------------------

             Summary: Prevent missing checkpoint due to unstable topology from causing complete
reindexing
                 Key: OAK-3436
                 URL: https://issues.apache.org/jira/browse/OAK-3436
             Project: Jackrabbit Oak
          Issue Type: Improvement
          Components: query
            Reporter: Chetan Mehrotra
            Assignee: Chetan Mehrotra
             Fix For: 1.3.6, 1.2.7, 1.0.22


Async indexing logic relies on embedding application to ensure that async indexing job is
run as a singleton in a cluster. For Sling based apps it depends on Sling Discovery support.
At times it is being seen that if topology is not stable then different cluster nodes can
consider them as leader and execute the async indexing job concurrently.

This can cause problem as both cluster node might not see same repository state (due to write
skew and eventual consistency) and might remove the checkpoint which other cluster node is
still relying upon. For e.g. consider a 2 node cluster N1 and N2 where both are performing
async indexing.

# Base state - CP1 is the checkpoint for "async" job
# N2 starts indexing and removes changes CP1 to CP2. For Mongo the checkpoints are saved in
{{settings}} collection
# N1 also decides to execute indexing but has yet not seen the latest repository state so
still thinks that CP1 is the base checkpoint and tries to read it. However CP1 is already
removed from {{settings}} and this makes N1 think that checkpoint is missing and it decides
to reindex everything!

To avoid this topology must be stable but at Oak level we should still handle such a case
and avoid doing a full reindexing. So we would need to have a {{MissingCheckpointStrategy}}
similar to {{MissingIndexEditorStrategy}} as done in OAK-2203 

Possible approaches
# A1 - Fail the indexing run if checkpoint is missing - Checkpoint being missing can have
valid reason and invalid reason. Need to see what are valid scenarios where a checkpoint can
go missing
# A2 - When a checkpoint is created also store the creation time. When a checkpoint is found
to be missing and its a *recent* checkpoint then fail the run. For e.g. we would fail the
run till checkpoint found to be missing is less than an hour old (for just started take startup
time into account)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message