jackrabbit-oak-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefan Egli (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (OAK-2739) take appropriate action when lease cannot be renewed (in time)
Date Wed, 29 Jul 2015 07:26:04 GMT

    [ https://issues.apache.org/jira/browse/OAK-2739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14645620#comment-14645620

Stefan Egli commented on OAK-2739:

{quote}Currently background lease is updated periodically (every 1 sec) by a dedicated thread
which just perform a single operation and not much. So even if there are issues in other parts
this thread would continue to work (which might be wrong) and still update the lease every
1 sec.{quote}
That's not entirely correct: the least is updated only every {{leaseTime / 2}} - by default
every 30 sec. It _checks_ every 1 sec but will only _update_ it every 30 sec.

{quote}So to me lease update does not look like an operation which would take long time and
cause above mentioned issues. May be I am missing something here{quote}
Generally speaking there are many reasons why one particular instance would not update a lease
in time:
# because it crashed
# because it can't talk to mongo anymore
# because the process was halted and continued (eg open/close laptop, process started in fg
- Ctrl-Z, kill -STOP, other terminal surprises)
# because the memory is very low, thus very long GC cycles, preventing much from happening
in the VM
# because the {{BackgroundLeaseUpdate}} task for some reason died
# because the {{BackgroundLeaseUpdate}} task for some reason is halted or runs into a deadlock
# because of something I forgot or some other yet to find-out VM mystery

In any case, the other instances have no way of figuring out the exact reason and they can
only assume that 1. happened. And if that's not the case, then this ticket is about finding
a way that prevents the instance from continuing should it not be able to update the lease
within eg 30sec. I think it's fair to demand that an instance is always capable of updating
the lease every 30sec and if it can't, then it shall remain silent once and for all. I'm not
saying it is a situation that is likely to occur very frequently - but if we're to build a
reliable system then this part imv is a critical part of it.

> take appropriate action when lease cannot be renewed (in time)
> --------------------------------------------------------------
>                 Key: OAK-2739
>                 URL: https://issues.apache.org/jira/browse/OAK-2739
>             Project: Jackrabbit Oak
>          Issue Type: Task
>          Components: mongomk
>    Affects Versions: 1.2
>            Reporter: Stefan Egli
>            Assignee: Stefan Egli
>              Labels: resilience
>             Fix For: 1.3.5
> Currently, in an oak-cluster when (e.g.) one oak-client stops renewing its lease (ClusterNodeInfo.renewLease()),
this will be eventually noticed by the others in the same oak-cluster. Those then mark this
client as {{inactive}} and start recoverying and subsequently removing that node from any
further merge etc operation.
> Now, whatever the reason was why that client stopped renewing the lease (could be an
exception, deadlock, whatever) - that client itself still considers itself as {{active}} and
continues to take part in the cluster action.
> This will result in a unbalanced situation where that one client 'sees' everybody as
{{active}} while the others see this one as {{inactive}}.
> If this ClusterNodeInfo state should be something that can be built upon, and to avoid
any inconsistency due to unbalanced handling, the inactive node should probably retire gracefully
- or any other appropriate action should be taken, other than just continuing as today.
> This ticket is to keep track of ideas and actions taken wrt this.

This message was sent by Atlassian JIRA

View raw message