jackrabbit-oak-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefan Egli (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (OAK-2739) take appropriate action when lease cannot be renewed (in time)
Date Tue, 28 Jul 2015 15:51:04 GMT

    [ https://issues.apache.org/jira/browse/OAK-2739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14644567#comment-14644567

Stefan Egli commented on OAK-2739:

Two possible strategies that could be followed-up here:

h3. Detect and React
* Detection: the {{BackgroundLeaseUpdate}} thread finds out that in the middle of operation
(ie not at startup) the lease was not existing when it expected it to still exist as it was
created by itself just an interval ago. So detection of this situation is easy.
* Reaction: reaction however is difficult and likely impossible. In theory an arbitrary long
time can have passed between the least timeout and when this gets detected. And during this
time there is nothing that prevents the {{DocumentNodeStore}} from writing new stuff in the
documents at all. For most of the data written in this phase it's not much of a problem. But
for data for example that is _topology dependent_ (eg dependent on the instance being a leader)
it can result in _duplicate leader situations_ which would not be resolvable after-the-fact

h3. Prevent
An alternative approach would be to prevent such a situation entirely. An instance would only
ever modify the {{DocumentStore}} when its lease is still valid. 
* Now this cannot be made dependent on the persisted lease state - as that thread could again
be blocked/prevented from updating the lease etc. 
* But perhaps a more robust and simpler approach would be to run an internal countdown watch
upon every lease renewal and *allow modifying requests to the DocumentStore only when this
clock has not yet hit zero*. This could be done with eg half of the lease-time - or with any
time that has a reasonable margin compared to the lease update and lease timeout values.

IMO we should go the 'prevent' way with an explicit lease-check before each document modification.
(this check would therefore have to be implemented in a very performing way, but that should
be a no-brainer). 

I'll follow up on this idea and will come up with a patch.

/cc [~mreutegg], [~chetanm], [~reschke], wdyt? 

> take appropriate action when lease cannot be renewed (in time)
> --------------------------------------------------------------
>                 Key: OAK-2739
>                 URL: https://issues.apache.org/jira/browse/OAK-2739
>             Project: Jackrabbit Oak
>          Issue Type: Task
>          Components: mongomk
>    Affects Versions: 1.2
>            Reporter: Stefan Egli
>            Assignee: Stefan Egli
>              Labels: resilience
>             Fix For: 1.3.5
> Currently, in an oak-cluster when (e.g.) one oak-client stops renewing its lease (ClusterNodeInfo.renewLease()),
this will be eventually noticed by the others in the same oak-cluster. Those then mark this
client as {{inactive}} and start recoverying and subsequently removing that node from any
further merge etc operation.
> Now, whatever the reason was why that client stopped renewing the lease (could be an
exception, deadlock, whatever) - that client itself still considers itself as {{active}} and
continues to take part in the cluster action.
> This will result in a unbalanced situation where that one client 'sees' everybody as
{{active}} while the others see this one as {{inactive}}.
> If this ClusterNodeInfo state should be something that can be built upon, and to avoid
any inconsistency due to unbalanced handling, the inactive node should probably retire gracefully
- or any other appropriate action should be taken, other than just continuing as today.
> This ticket is to keep track of ideas and actions taken wrt this.

This message was sent by Atlassian JIRA

View raw message