jackrabbit-oak-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chetan Mehrotra (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (OAK-3436) Prevent missing checkpoint due to unstable topology from causing complete reindexing
Date Thu, 24 Sep 2015 15:36:04 GMT

    [ https://issues.apache.org/jira/browse/OAK-3436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906506#comment-14906506
] 

Chetan Mehrotra commented on OAK-3436:
--------------------------------------

After being pointed out by Alex above I tried to recreate the scenario and so far not able
to get it going. The lease check does prevent concurrent execution. From customer logs following
can be made out

h3. Cluster Node N2

*run-n2-1*
{noformat}
16.09.2015 15:29:57.292 *WARN* [pool-5-thread-1] org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate
Failed to retrieve previously indexed checkpoint r14fd6c63cbf-0-3; re-running the initial
async index update
16.09.2015 15:29:57.314 *INFO* [pool-5-thread-1] org.apache.jackrabbit.oak.plugins.index.IndexUpdate
Found a new index node [authorizables]. Reindexing is requested
16.09.2015 15:30:14.009 *INFO* [pool-5-thread-1] org.apache.jackrabbit.oak.plugins.index.IndexUpdate
Found a new index node [counter]. Reindexing is requested

16.09.2015 15:39:29.262 *INFO* [pool-5-thread-1] org.apache.jackrabbit.oak.plugins.document.mongo.MongoDocumentStore
query timed out after 3000 milliseconds and will be retried without lock [4:/oak:index/uuid/:index/,
4:/oak:index/uuid/:index0, _modified, 1442417375, 2147483647]
16.09.2015 15:40:47.235 *INFO* [pool-5-thread-1] org.apache.jackrabbit.oak.plugins.document.mongo.MongoDocumentStore
query timed out after 3000 milliseconds and will be retried without lock [4:/oak:index/uuid/:index/,
4:/oak:index/uuid/:index0, _modified, 1442417375, 2147483647]
16.09.2015 15:41:59.617 *ERROR* [pool-5-thread-1] org.apache.sling.commons.scheduler.impl.QuartzScheduler
Exception during job execution of org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate@74561c24
: operation exceeded time limit
com.mongodb.MongoExecutionTimeoutException: operation exceeded time limit
	at com.mongodb.QueryResultIterator.throwOnQueryFailure(QueryResultIterator.java:244)
	at com.mongodb.QueryResultIterator.init(QueryResultIterator.java:224)
	at com.mongodb.QueryResultIterator.initFromQueryResponse(QueryResultIterator.java:184)
	at com.mongodb.QueryResultIterator.<init>(QueryResultIterator.java:62)
	at com.mongodb.DBCollectionImpl.find(DBCollectionImpl.java:86)
	at com.mongodb.DBCollectionImpl.find(DBCollectionImpl.java:66)
	at com.mongodb.DBCursor._check(DBCursor.java:498)
	at com.mongodb.DBCursor._hasNext(DBCursor.java:621)
	at com.mongodb.DBCursor.hasNext(DBCursor.java:657)
	at org.apache.jackrabbit.oak.plugins.document.mongo.MongoDocumentStore.queryInternal(MongoDocumentStore.java:647)
	at org.apache.jackrabbit.oak.plugins.document.mongo.MongoDocumentStore.query(MongoDocumentStore.java:578)
	at org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore.diffManyChildren(DocumentNodeStore.java:2237)
	at org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore.diffImpl(DocumentNodeStore.java:2205)
	at org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore.access$400(DocumentNodeStore.java:122)
	at org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore$10.call(DocumentNodeStore.java:1395)
	at org.apache.jackrabbit.oak.plugins.document.MemoryDiffCache.getChanges(MemoryDiffCache.java:74)
	at org.apache.jackrabbit.oak.plugins.document.TieredDiffCache.getChanges(TieredDiffCache.java:50)
	at org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore.compare(DocumentNodeStore.java:1390)
	at org.apache.jackrabbit.oak.plugins.document.DocumentNodeState.compareAgainstBaseState(DocumentNodeState.java:343)
	at org.apache.jackrabbit.oak.plugins.document.CommitDiff.childNodeChanged(CommitDiff.java:96)
	at org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore.dispatch(DocumentNodeStore.java:2091)
	at org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore.compare(DocumentNodeStore.java:1390)
	at org.apache.jackrabbit.oak.plugins.document.DocumentNodeState.compareAgainstBaseState(DocumentNodeState.java:343)
	at org.apache.jackrabbit.oak.plugins.document.CommitDiff.childNodeChanged(CommitDiff.java:96)
	at org.apache.jackrabbit.oak.plugins.memory.ModifiedNodeState$1.childNodeChanged(ModifiedNodeState.java:429)
	at org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore.dispatch(DocumentNodeStore.java:2091)
	at org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore.compare(DocumentNodeStore.java:1390)
	at org.apache.jackrabbit.oak.plugins.document.DocumentNodeState.compareAgainstBaseState(DocumentNodeState.java:343)
	at org.apache.jackrabbit.oak.plugins.memory.ModifiedNodeState.compareAgainstBaseState(ModifiedNodeState.java:405)
	at org.apache.jackrabbit.oak.plugins.document.CommitDiff.childNodeChanged(CommitDiff.java:96)
	at org.apache.jackrabbit.oak.plugins.memory.ModifiedNodeState.compareAgainstBaseState(ModifiedNodeState.java:399)
	at org.apache.jackrabbit.oak.plugins.document.DocumentNodeStoreBranch$1.with(DocumentNodeStoreBranch.java:302)
	at org.apache.jackrabbit.oak.plugins.document.DocumentNodeStoreBranch.persist(DocumentNodeStoreBranch.java:323)
	at org.apache.jackrabbit.oak.plugins.document.DocumentNodeStoreBranch.persist(DocumentNodeStoreBranch.java:299)
	at org.apache.jackrabbit.oak.plugins.document.DocumentNodeStoreBranch.access$300(DocumentNodeStoreBranch.java:52)
	at org.apache.jackrabbit.oak.plugins.document.DocumentNodeStoreBranch$Persisted.persistTransientHead(DocumentNodeStoreBranch.java:679)
	at org.apache.jackrabbit.oak.plugins.document.DocumentNodeStoreBranch$Persisted.access$000(DocumentNodeStoreBranch.java:573)
	at org.apache.jackrabbit.oak.plugins.document.DocumentNodeStoreBranch$BranchState.persist(DocumentNodeStoreBranch.java:390)
	at org.apache.jackrabbit.oak.plugins.document.DocumentNodeStoreBranch$InMemory.setRoot(DocumentNodeStoreBranch.java:519)
	at org.apache.jackrabbit.oak.plugins.document.DocumentNodeStoreBranch.setRoot(DocumentNodeStoreBranch.java:163)
	at org.apache.jackrabbit.oak.plugins.document.DocumentRootBuilder.purge(DocumentRootBuilder.java:199)
	at org.apache.jackrabbit.oak.plugins.document.DocumentRootBuilder.updated(DocumentRootBuilder.java:99)
	at org.apache.jackrabbit.oak.plugins.memory.MemoryNodeBuilder.updated(MemoryNodeBuilder.java:214)
	at org.apache.jackrabbit.oak.plugins.memory.MemoryNodeBuilder.remove(MemoryNodeBuilder.java:355)
	at org.apache.jackrabbit.oak.plugins.document.DocumentNodeBuilder.removeInternal(DocumentNodeBuilder.java:123)
	at org.apache.jackrabbit.oak.plugins.document.DocumentNodeBuilder.removeRecursive(DocumentNodeBuilder.java:130)
	at org.apache.jackrabbit.oak.plugins.document.DocumentNodeBuilder.removeRecursive(DocumentNodeBuilder.java:128)
	at org.apache.jackrabbit.oak.plugins.document.DocumentNodeBuilder.removeRecursive(DocumentNodeBuilder.java:128)
	at org.apache.jackrabbit.oak.plugins.document.DocumentNodeBuilder.removeRecursive(DocumentNodeBuilder.java:128)
	at org.apache.jackrabbit.oak.plugins.document.DocumentNodeBuilder.removeRecursive(DocumentNodeBuilder.java:128)
	at org.apache.jackrabbit.oak.plugins.document.DocumentNodeBuilder.removeRecursive(DocumentNodeBuilder.java:128)
	at org.apache.jackrabbit.oak.plugins.document.DocumentNodeBuilder.remove(DocumentNodeBuilder.java:117)
	at org.apache.jackrabbit.oak.plugins.index.IndexUpdate.collectIndexEditors(IndexUpdate.java:187)
	at org.apache.jackrabbit.oak.plugins.index.IndexUpdate.enter(IndexUpdate.java:114)
	at org.apache.jackrabbit.oak.spi.commit.VisibleEditor.enter(VisibleEditor.java:57)
	at org.apache.jackrabbit.oak.spi.commit.EditorDiff.process(EditorDiff.java:49)
	at org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate.updateIndex(AsyncIndexUpdate.java:367)
	at org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate.run(AsyncIndexUpdate.java:312)
	at org.apache.sling.commons.scheduler.impl.QuartzJobExecutor.execute(QuartzJobExecutor.java:105)
	at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:722)
{noformat}

h3. Cluster Node N1

There are two occurances
*run-n1-1*
{noformat}
16.09.2015 15:30:01.982 *WARN* [pool-5-thread-2] org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate
Failed to retrieve previously indexed checkpoint r14fd6c6782f-0-3; re-running the initial
async index update
16.09.2015 15:30:12.121 *WARN* [pool-5-thread-2] org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate
The async index update failed
org.apache.jackrabbit.oak.api.CommitFailedException: OakMerge0001: OakMerge0001: Failed to
merge changes to the underlying store (retries 5, 4482 ms)
	at org.apache.jackrabbit.oak.plugins.document.DocumentNodeStoreBranch.merge0(DocumentNodeStoreBranch.java:239)
	at org.apache.jackrabbit.oak.plugins.document.DocumentNodeStoreBranch.merge(DocumentNodeStoreBranch.java:179)
	at org.apache.jackrabbit.oak.plugins.document.DocumentRootBuilder.merge(DocumentRootBuilder.java:158)
	at org.apache.jackrabbit.oak.plugins.document.DocumentNodeStore.merge(DocumentNodeStore.java:1495)
	at org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate.mergeWithConcurrencyCheck(AsyncIndexUpdate.java:450)
	at org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate.access$400(AsyncIndexUpdate.java:72)
	at org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate$AsyncUpdateCallback.<init>(AsyncIndexUpdate.java:189)
	at org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate.updateIndex(AsyncIndexUpdate.java:359)
	at org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate.run(AsyncIndexUpdate.java:312)
	at org.apache.sling.commons.scheduler.impl.QuartzJobExecutor.execute(QuartzJobExecutor.java:105)
	at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:722)
Caused by: org.apache.jackrabbit.oak.plugins.document.DocumentStoreException: The node 1:/:async
was changed in revision
r14fd6c6ce3c-0-2, which was applied after the base revision
r14fd6c6d12d-1-3, before
r14fd6c70814-0-3
	at org.apache.jackrabbit.oak.plugins.document.Commit.checkConflicts(Commit.java:557)
	at org.apache.jackrabbit.oak.plugins.document.Commit.createOrUpdateNode(Commit.java:471)
	at org.apache.jackrabbit.oak.plugins.document.Commit.applyToDocumentStore(Commit.java:368)
	at org.apache.jackrabbit.oak.plugins.document.Commit.applyToDocumentStore(Commit.java:262)
	at org.apache.jackrabbit.oak.plugins.document.Commit.applyInternal(Commit.java:231)
	at org.apache.jackrabbit.oak.plugins.document.Commit.apply(Commit.java:216)
	at org.apache.jackrabbit.oak.plugins.document.DocumentNodeStoreBranch.persist(DocumentNodeStoreBranch.java:329)
	at org.apache.jackrabbit.oak.plugins.document.DocumentNodeStoreBranch.persist(DocumentNodeStoreBranch.java:299)
	at org.apache.jackrabbit.oak.plugins.document.DocumentNodeStoreBranch.access$300(DocumentNodeStoreBranch.java:52)
	at org.apache.jackrabbit.oak.plugins.document.DocumentNodeStoreBranch$InMemory.merge(DocumentNodeStoreBranch.java:545)
	at org.apache.jackrabbit.oak.plugins.document.DocumentNodeStoreBranch.merge0(DocumentNodeStoreBranch.java:223)
	... 13 common frames omitted
{noformat}

*run-n1-2*
{noformat}
16.09.2015 15:42:01.977 *WARN* [pool-5-thread-2] org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate
Failed to retrieve previously indexed checkpoint r14fd6c6782f-0-3; re-running the initial
async index update
16.09.2015 15:42:01.984 *INFO* [pool-5-thread-2] org.apache.jackrabbit.oak.plugins.index.IndexUpdate
Found a new index node [authorizables]. Reindexing is requested
16.09.2015 15:42:01.999 *INFO* [pool-5-thread-2] org.apache.jackrabbit.oak.plugins.index.IndexUpdate
Found a new index node [counter]. Reindexing is requested
16.09.2015 15:45:52.717 *INFO* [pool-5-thread-2] org.apache.jackrabbit.oak.plugins.index.IndexUpdate
Reindexing will be performed for following indexes: [/oak:index/counter, /oak:index/damAssetLucene,
/oak:index/authorizables, /oak:index/lucene, /oak:index/lucenecqLastRolledoutBy, /oak:index/cqPageLucene,
/oak:index/ntBaseLucene, /oak:index/cqTagLucene, /oak:index/workflowDataLucene, /oak:index/imageReferenceLucene]
{noformat}

h3. Observation

# For some reason on N2 fails to retrieve the checkpoint r14fd6c63cbf-0-3, 2015-09-16T15:29:20.063
UTC (log time 16.09.2015 15:29:57.292) and it proceeds to start reindexing
{noformat}
16.09.2015 15:29:57.292 *WARN* [pool-5-thread-1] org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate
Failed to retrieve previously indexed checkpoint r14fd6c63cbf-0-3; re-running the initial
async index update
{noformat}
# In few moments later on N1 also system runs async index and fails to find the checkpoint
r14fd6c6782f-0-3, 2015-09-16T15:29:35.279 UTC, (log time 16.09.2015 15:30:01.982) and attempts
the index. 
{noformat}
16.09.2015 15:30:01.982 *WARN* [pool-5-thread-2] org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate
Failed to retrieve previously indexed checkpoint r14fd6c6782f-0-3; re-running the initial
async index update
{noformat}
# On N1 the run fails with commit exception due to concurrent update
{noformat}
Caused by: org.apache.jackrabbit.oak.plugins.document.DocumentStoreException: The node 1:/:async
was changed in revision
r14fd6c6ce3c-0-2, which was applied after the base revision
r14fd6c6d12d-1-3, before
r14fd6c70814-0-3
{noformat}
# At some time later run on N2 gives up with exception in Mongo call
{noformat}
16.09.2015 15:40:47.235 *INFO* [pool-5-thread-1] org.apache.jackrabbit.oak.plugins.document.mongo.MongoDocumentStore
query timed out after 3000 milliseconds and will be retried without lock [4:/oak:index/uuid/:index/,
4:/oak:index/uuid/:index0, _modified, 1442417375, 2147483647]
16.09.2015 15:41:59.617 *ERROR* [pool-5-thread-1] org.apache.sling.commons.scheduler.impl.QuartzScheduler
Exception during job execution of org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate@74561c24
: operation exceeded time limit
com.mongodb.MongoExecutionTimeoutException: operation exceeded time limit
{noformat}
# Then after few moments on N1 run starts again and this time it proceeds but with reindexing
{noformat}
16.09.2015 15:42:01.977 *WARN* [pool-5-thread-2] org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate
Failed to retrieve previously indexed checkpoint r14fd6c6782f-0-3; re-running the initial
async index update
...
16.09.2015 15:45:52.717 *INFO* [pool-5-thread-2] org.apache.jackrabbit.oak.plugins.index.IndexUpdate
Reindexing will be performed for following indexes: [/oak:index/counter, /oak:index/damAssetLucene,
/oak:index/authorizables, /oak:index/lucene, /oak:index/lucenecqLastRolledoutBy, /oak:index/cqPageLucene,
/oak:index/ntBaseLucene, /oak:index/cqTagLucene, /oak:index/workflowDataLucene, /oak:index/imageReferenceLucene]
{noformat}

h3. Comments/Open Questions

So it can be seen that concurrent run get handled and fail early itself. However following
open question remain
# Why did the checkpoint got missing. Note this was detected by both cluster nodes
# When the run on N2 failed it seems to be performing query on oak:/index/uuid/:index which
is strange as that path should not be touched. Do note that here we are dealing with a persisted
branch as index

> Prevent missing checkpoint due to unstable topology from causing complete reindexing
> ------------------------------------------------------------------------------------
>
>                 Key: OAK-3436
>                 URL: https://issues.apache.org/jira/browse/OAK-3436
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: query
>            Reporter: Chetan Mehrotra
>            Assignee: Chetan Mehrotra
>             Fix For: 1.3.7, 1.2.7, 1.0.22
>
>
> Async indexing logic relies on embedding application to ensure that async indexing job
is run as a singleton in a cluster. For Sling based apps it depends on Sling Discovery support.
At times it is being seen that if topology is not stable then different cluster nodes can
consider them as leader and execute the async indexing job concurrently.
> This can cause problem as both cluster node might not see same repository state (due
to write skew and eventual consistency) and might remove the checkpoint which other cluster
node is still relying upon. For e.g. consider a 2 node cluster N1 and N2 where both are performing
async indexing.
> # Base state - CP1 is the checkpoint for "async" job
> # N2 starts indexing and removes changes CP1 to CP2. For Mongo the checkpoints are saved
in {{settings}} collection
> # N1 also decides to execute indexing but has yet not seen the latest repository state
so still thinks that CP1 is the base checkpoint and tries to read it. However CP1 is already
removed from {{settings}} and this makes N1 think that checkpoint is missing and it decides
to reindex everything!
> To avoid this topology must be stable but at Oak level we should still handle such a
case and avoid doing a full reindexing. So we would need to have a {{MissingCheckpointStrategy}}
similar to {{MissingIndexEditorStrategy}} as done in OAK-2203 
> Possible approaches
> # A1 - Fail the indexing run if checkpoint is missing - Checkpoint being missing can
have valid reason and invalid reason. Need to see what are valid scenarios where a checkpoint
can go missing
> # A2 - When a checkpoint is created also store the creation time. When a checkpoint is
found to be missing and its a *recent* checkpoint then fail the run. For e.g. we would fail
the run till checkpoint found to be missing is less than an hour old (for just started take
startup time into account)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message