jackrabbit-oak-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Dürig (JIRA) <j...@apache.org>
Subject [jira] [Commented] (OAK-7852) Blocked background flush can cause sever data loss
Date Mon, 22 Oct 2018 10:12:00 GMT

    [ https://issues.apache.org/jira/browse/OAK-7852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658813#comment-16658813

Michael Dürig commented on OAK-7852:

One approach to ensure segments are flushed at regular intervals even when the flush thread
fails is to piggy back them on other write operations. There is some risk here regarding introducing
deadlocks, subtle race conditions and imposing alien exceptions on the thread hijacked for
the piggy backed flush. An advantage of this approach however is that the number of unfleshed
segments would be in direct relation to the write rate.

To mitigate the risks we could execute the actual flush an a separate and short-lived thread.
If we keep track of those threads in a separate class this would allow us to closely monitor
failed / blocked flushes and take actions like logging warnings or blocking repository write

See [https://github.com/mduerig/jackrabbit-oak/commit/52bba59b473776a36aa582a224a983920b191dd4]
for an initial implementation.


A complete different approach would be to monitor the flush rate and block writes to the repository
once it drops beyond a certain threshold (and log a warning along with it). This should be
relatively easy to implement by wrapping {{WriteOperationHandler}}. This approach is probably
more in the area of OAK-7854. 


[~frm], WDYT is the best approach going forward here?

> Blocked background flush can cause sever data loss 
> ---------------------------------------------------
>                 Key: OAK-7852
>                 URL: https://issues.apache.org/jira/browse/OAK-7852
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: segment-tar
>            Reporter: Michael Dürig
>            Assignee: Michael Dürig
>            Priority: Major
>             Fix For: 1.10
> When the {{FileStore background task}} fails (e.g. because of a deadlock) and the {{FileStore}}
is subsequently shutdown in an unclean way ({{kill -9}}) then there is a risk of a sever data
loss. Although a journal could be reconstructed from the segments, there is a chance that
most if not all of the revisions written since the failure of the background tasks are inconsistent
with a {{SNFE}}. 
> The expectation for such a case should be that a journal could be reconstructed from
the segments and that all but the last few revisions are consistent.

This message was sent by Atlassian JIRA

View raw message