jackrabbit-oak-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chetan Mehrotra (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (OAK-3436) Prevent missing checkpoint due to unstable topology from causing complete reindexing
Date Mon, 28 Sep 2015 09:17:04 GMT

    [ https://issues.apache.org/jira/browse/OAK-3436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14920059#comment-14920059

Chetan Mehrotra commented on OAK-3436:

bq. Let me see if I got this right

You got it perfectly right! Picture is worth thousand words ...

bq. my guess is the lease is expiring because in some cases the traversal is too long

In theory yes thats the possible way for this scenario to happen. Lease is updated after elapse
of 15 mins *and* 100 callbacks. In async indexing close of index can be costly as Lucene might
decide to run compaction and in such case long time can elapse

bq. One thing I don't understand is why doesn't N2 fail to acquire the lease once N1 starts
indexing? There's a moment when N2 times-out, N1 starts indexing, but at T3 when N2 comes
back it should fail the commit, no?

Depending on timings there is a small race condition. When N1 detects that N2 has timed out
it proceeds to update :async. But before that it perform the tmp checkpoint cleanup. Now it
can happen that once it did the checkpoint cleanup but before it did the :async update N2
comes back and update the lease again causing N1 to fail and still let N2 to complete

bq. Also if there are blackout intervals where the lease expires and the other node is taking
over, then this means the 2 nodes will always be competing for indexing

Yes that can very well happen

bq. would increasing the lease timeout help mitigate this issue

At least for this scenario this should help

bq. I'm not against this option, but I'd like to clarify the lease stuff first, if possible.

Would try to get more details ... but as it is intermittent getting more details would be

> Prevent missing checkpoint due to unstable topology from causing complete reindexing
> ------------------------------------------------------------------------------------
>                 Key: OAK-3436
>                 URL: https://issues.apache.org/jira/browse/OAK-3436
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: query
>            Reporter: Chetan Mehrotra
>            Assignee: Chetan Mehrotra
>             Fix For: 1.3.8, 1.2.7, 1.0.22
>         Attachments: AsyncIndexUpdateClusterTest.java, OAK-3436-0.patch
> Async indexing logic relies on embedding application to ensure that async indexing job
is run as a singleton in a cluster. For Sling based apps it depends on Sling Discovery support.
At times it is being seen that if topology is not stable then different cluster nodes can
consider them as leader and execute the async indexing job concurrently.
> This can cause problem as both cluster node might not see same repository state (due
to write skew and eventual consistency) and might remove the checkpoint which other cluster
node is still relying upon. For e.g. consider a 2 node cluster N1 and N2 where both are performing
async indexing.
> # Base state - CP1 is the checkpoint for "async" job
> # N2 starts indexing and removes changes CP1 to CP2. For Mongo the checkpoints are saved
in {{settings}} collection
> # N1 also decides to execute indexing but has yet not seen the latest repository state
so still thinks that CP1 is the base checkpoint and tries to read it. However CP1 is already
removed from {{settings}} and this makes N1 think that checkpoint is missing and it decides
to reindex everything!
> To avoid this topology must be stable but at Oak level we should still handle such a
case and avoid doing a full reindexing. So we would need to have a {{MissingCheckpointStrategy}}
similar to {{MissingIndexEditorStrategy}} as done in OAK-2203 
> Possible approaches
> # A1 - Fail the indexing run if checkpoint is missing - Checkpoint being missing can
have valid reason and invalid reason. Need to see what are valid scenarios where a checkpoint
can go missing
> # A2 - When a checkpoint is created also store the creation time. When a checkpoint is
found to be missing and its a *recent* checkpoint then fail the run. For e.g. we would fail
the run till checkpoint found to be missing is less than an hour old (for just started take
startup time into account)

This message was sent by Atlassian JIRA

View raw message