jackrabbit-oak-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcel Reutegger (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (OAK-4739) lease: immediate renew after long renew call
Date Thu, 01 Sep 2016 10:17:21 GMT

    [ https://issues.apache.org/jira/browse/OAK-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15455001#comment-15455001
] 

Marcel Reutegger commented on OAK-4739:
---------------------------------------

bq. In the current implementation this retry gets never called.

I assume you are referring to the background lease update thread that periodically performs
the lease renew. In your case the lease update thread must not retry the renew operation because
the lease timed out. A cluster node in this state requires recovery, which could have been
initiated already by another cluster node. There is also the discovery mechanism that depends
on the correct handling of lease updates and state transitions.

bq. It's an improvement that the code tries to recover (at least once) from a network issue.

That's a reasonable request, but we need to distinguish cases that can be handled by the lease
renew (i.e. short network issues) and others that cannot be handled on this layer. Longer
network issues (more than the lease timeout) require recovery of the entire DocumentNodeStore
and currently means manual intervention because the oak-core bundle is stopped (OAK-3397).
Alternatively, a restart of the DocumentNodeStore was proposed (OAK-3250), but comes with
open questions like how to handle open sessions with listeners.

> lease: immediate renew after long renew call
> --------------------------------------------
>
>                 Key: OAK-4739
>                 URL: https://issues.apache.org/jira/browse/OAK-4739
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: documentmk
>    Affects Versions: 1.5.8
>            Reporter: Martin Böttcher
>
> A single temporary network issue can shut down the DocumentStore. We observed the following
situation:
> # org.apache.jackrabbit.oak.plugins.document.ClusterNodeInfo.renewLease was called (this
is done regularly and completely normal)
> # the network had a temporary issue (whatsoever)
> # the database call terminated after a lot of time (the default db maxWaitTime is 120
seconds).
> # org.apache.jackrabbit.oak.plugins.document.ClusterNodeInfo.renewLease decides that
the current lease is too old (>120 seconds thats the default for the oak.documentMK.leaseDurationSeconds
property), sets a leaseCheckFailed variable and throws an Exception
> # because leaseCheckFailed is set all following tries (if any) will immediately throw
an Exception, too.
> I'd recommend to make the ClusterNodeInfo code more robust so that at least one retry
will be made.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message