jackrabbit-oak-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chetan Mehrotra (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (OAK-3110) AsyncIndexer fails due to FileNotFoundException thrown by CopyOnWrite logic
Date Thu, 16 Jul 2015 05:06:05 GMT

    [ https://issues.apache.org/jira/browse/OAK-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629197#comment-14629197
] 

Chetan Mehrotra edited comment on OAK-3110 at 7/16/15 5:05 AM:
---------------------------------------------------------------

Attaching [logs|^copier.log] from IndexCopier and AsyncIndexUpdate for a single run where
the run failed due to exception in CopyOnWrite. The issue here has its root in OAK-2855

h3. Interaction between CopyOnWrite (COW) and CopyOnRead (COR)

COW logic is implemented with the requirement that it has to *share* the same local directory
as used in COR. This ensures that it can make use of already copied index files while indexing.
Any indexing sequence would work in following sequence

# COW - Index done - Index moved to R1
# COR - Open at R1
# COW - Index moves to R2
# COR - Picks up index at R2
# COW - Starts indexing for R3

Further COR takes a snapshot of current local directory content and when it closes it deletes
those files which are present in local directory but not present in remote directory. This
ensures that issues seen in OAK-2855 do not happen. This approach would have worked fine if
above sequence is strictly sequential. However in real world thats not the case as the 2 tasks
are performed via different threads. 

For e.g. say at #3 when COW is done with indexing the Lucene index update observor would detect
that index has updated. This observation processing would happen in a different thread. Now
when at #4 COR is opening the index it can happen that COW has also started with its index
at #5 and hence started creating new files. These new files would be picked by COR at #4 during
its opening phase and it would consider them part of old index. Now when COR would close it
would delete these files and that would then result in a FileNotFoundException for COW in
#5

h3. Mapping to Logs

In the attached logs following timeline is seen

{noformat}
14:20:01.700 *DEBUG* [pool-58-thread-1] AsyncIndexUpdate Running background index task async
14:20:08.601 *DEBUG* [pool-58-thread-1] IndexCopier [COW] Creating output _2s7.fdt
14:20:18.146 *DEBUG* [oak-lucene-0] IndexCopier [/oak:index/foxDamIndex] Following files have
been removed from Lucene index directory [_2s7.fdt, _2s7.fdx]
14:20:35.752 *DEBUG* [oak-lucene-0] IndexCopier Error occurred while copying files. Further
processing would be skipped
java.io.FileNotFoundException: _2s7.fdt
	at org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:261)
14:20:35.930 *WARN* [pool-58-thread-1] AsyncIndexUpdate The async index update failed
{noformat}

So as can be seen that COW starts at 20:01 and creates _2s7.fdt at 20:08. However that file
got deleted by oak-lucene-0 (thread which opens index with COR) at 20:18 and then eventually
FileNotFoundException is thrown when copy logic in COW tries to read the file

h3. Fix

As a fix COW and COR would have to share the current working set and COR should not touch
files in that working set


was (Author: chetanm):
Attaching [logs|^copier.log] from IndexCopier and AsyncIndexUpdate for a single run where
the run failed due to exception in CopyOnWrite. The issue here has its root in OAK-2855

h3. Interaction between CopyOnWrite (COW) and CopyOnRead (COR)

COW logic is implemented with the requirement that it has sure the same local directory as
used in COR. This ensures that it can make use of alread copied index files while indexing.
Any indexing sequence would work in following sequence

# COW - Index done - Index moved to R1
# COR - Open at R1
# COW - Index moves to R2
# COR - Picks up index at R2
# COW - Starts indexing for R3

Further COR takes a snapshot of current local directory content and when it closes it deletes
those files which are present in local directory but not present in remote directory. This
ensures that issues seen in OAK-2855 do not happen. This approach would have worked fine if
above sequence is strictly sequential. However in real world thats not the case as the 2 task
are performed via different threads. 

For e.g. say at #3 when COW is done with indexing the Lucene index update observor would detect
that index has updated. This observation processing would happen in a different thread. Now
when at #4 COR is opening the index it can happen that COW has also started with its index
at #5 and hence started creating new files. These new files would be picked by COR at #4 during
its opening phase and it would consider them part of old index. Now when COR would close it
would delete these files and that would then result in a FileNotFoundException for COW in
#5

h3. Mapping to Logs

In the attached logs following timeline is seen

{noformat}
14:20:01.700 *DEBUG* [pool-58-thread-1] AsyncIndexUpdate Running background index task async
14:20:08.601 *DEBUG* [pool-58-thread-1] IndexCopier [COW] Creating output _2s7.fdt
14:20:18.146 *DEBUG* [oak-lucene-0] IndexCopier [/oak:index/foxDamIndex] Following files have
been removed from Lucene index directory [_2s7.fdt, _2s7.fdx]
14:20:35.752 *DEBUG* [oak-lucene-0] IndexCopier Error occurred while copying files. Further
processing would be skipped
java.io.FileNotFoundException: _2s7.fdt
	at org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:261)
14:20:35.930 *WARN* [pool-58-thread-1] AsyncIndexUpdate The async index update failed
{noformat}

So as can be seen that COW starts at 20:01 and creates _2s7.fdt at 20:08. However that file
got deleted by oak-lucene-0 (thread which opens index with COR) at 20:18 and then eventually
FileNotFoundException is thrown when copy logic in COW tries to read the file

h3. Fix

As a fix COW and COR would have to share the current working set and COR should not touch
files in that working set

> AsyncIndexer fails due to FileNotFoundException thrown by CopyOnWrite logic
> ---------------------------------------------------------------------------
>
>                 Key: OAK-3110
>                 URL: https://issues.apache.org/jira/browse/OAK-3110
>             Project: Jackrabbit Oak
>          Issue Type: Bug
>          Components: lucene
>            Reporter: Chetan Mehrotra
>            Assignee: Chetan Mehrotra
>             Fix For: 1.2.3, 1.3.3, 1.0.18
>
>         Attachments: copier.log
>
>
> At times the CopyOnWrite reports following exception
> {noformat}
> 15.07.2015 14:20:35.930 *WARN* [pool-58-thread-1] org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate
The async index update failed
> org.apache.jackrabbit.oak.api.CommitFailedException: OakLucene0004: Failed to close the
Lucene index
> 	at org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexEditor.leave(LuceneIndexEditor.java:204)
> 	at org.apache.jackrabbit.oak.plugins.index.IndexUpdate.leave(IndexUpdate.java:219)
> 	at org.apache.jackrabbit.oak.spi.commit.VisibleEditor.leave(VisibleEditor.java:63)
> 	at org.apache.jackrabbit.oak.spi.commit.EditorDiff.process(EditorDiff.java:56)
> 	at org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate.updateIndex(AsyncIndexUpdate.java:366)
> 	at org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate.run(AsyncIndexUpdate.java:311)
> 	at org.apache.sling.commons.scheduler.impl.QuartzJobExecutor.execute(QuartzJobExecutor.java:105)
> 	at org.quartz.core.JobRunShell.run(JobRunShell.java:207)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.FileNotFoundException: _2s7.fdt
> 	at org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:261)
> 	at org.apache.jackrabbit.oak.plugins.index.lucene.IndexCopier$CopyOnWriteDirectory$COWLocalFileReference.fileLength(IndexCopier.java:837)
> 	at org.apache.jackrabbit.oak.plugins.index.lucene.IndexCopier$CopyOnWriteDirectory.fileLength(IndexCopier.java:607)
> 	at org.apache.lucene.index.SegmentCommitInfo.sizeInBytes(SegmentCommitInfo.java:141)
> 	at org.apache.lucene.index.DocumentsWriterPerThread.sealFlushedSegment(DocumentsWriterPerThread.java:529)
> 	at org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:502)
> 	at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:508)
> 	at org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:618)
> 	at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3147)
> 	at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3123)
> 	at org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:988)
> 	at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:932)
> 	at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:894)
> 	at org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexEditorContext.closeWriter(LuceneIndexEditorContext.java:192)
> 	at org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexEditor.leave(LuceneIndexEditor.java:202)
> 	... 10 common frames omitted
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message