qpid-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rajith Attapattu <rajit...@gmail.com>
Subject Re: Failover
Date Thu, 15 Sep 2011 15:13:17 GMT
The issues highlighted by Robbie are pretty much the problem areas
that I have identified as well (along with a few more).
All in all the failover code is the Achilles heel in the JMS client
and most of the stability issues, deadlocks and race conditions are
around this area.

I've been collecting some notes for a while and let me add to Robbie
list. I've skipped the points Robbie already discussed.
A few more areas that I think we need to consider when discussing this
area (I've noted the relevant JIRA's along with the points),

1. Fundamental flaws in the current failover design.
    1.1 IMO the level of abstraction seems wrong. The way failover
works at the JMS layer is sometimes at odds with the way failover
works at the version specific amqp layer (partly explained by #3).
Some times the same decision is made at different levels leading to
deadlocks. Sometimes there is lack of coordination between the JMS
layer and the version specific layer resulting in correctness issues.
Ex the JMS layer proceeding without waiting for the lower layer to
complete failover or vice-versa.

    1.2 Little or no thought given to exception handling (see #2).

    1.3 Certain methods/strategies are not implemented with failover
in mind - retro fitting has caused undesirable behaviour (see #4 &
#5).

    1.4 Use of global locks that leaks across class boundaries without
a clearly strategy is a recipe for disaster. There is no way to really
ensure ordering guarantees to prevent deadlocks. Ex. the notorious
failover_mutex :)

    1.5 No clear statement as to what our failover guarantees are
(w.r.t to each amqp version). This is causing upgrading issues and in
general confusion for Qpid users.

2. Exception handling
    There are two main areas of concern here, and they definitely
affect the failover experience for end users.

     2.1 Exceptions being reported in two directions.
           Currently exceptions can be thrown when calling synchronous
operations, but also the same exception is being notified via the
exception listener.
           In most cases this will result in the connection being
closed needlessly and in some cases this results in a deadlock. Ex
QPID-3259

     2.2 Exceptions doesn't provide enough information for Application
developers.
           Currently there is little or no information in the
exceptions for app developers to figure out if they should be
recreating the session or the connection.
           For example if a an ACL exception or a resource limit
exceeded exception is thrown by the broker it's not easy to
distinguish between that and a connection closed exception when
sending a method.
           The problem is exacerbated due the problem identified in 2.1

3. The differences between 0-10 and pre 0-10 versions.
    This IMO opinion is another source bugs at least in the 0-10 path.
The pre 0-10 versions does not have have a "sync()" operation where as
the 0-10 version does. The way failover is designed/coded any sync
operation during failover will cause deadlocks.  Ex QPID-2808 was
causing QPID-2809. In the end we had to do a workaround.

4. Recover() implementation
    I believe the implementation of recovery is broken w.r.t failover
(and in maybe in general). (needs a JIRA)

5. Client ACK is broken w.r.t failover. There are 3 use cases to
consider and I don't think we handle even one correctly. QPID-3462

Regards,

Rajith


On Thu, Sep 15, 2011 at 11:03 AM, Robbie Gemmell
<robbie.gemmell@gmail.com> wrote:
> Hi all,
>
> There are currently a number of issues with the Failover behaviour of
> the client which require some attention. It would be good to discuss
> them and work towards having the Failover implementation more fully
> meet user expectations. I am going to be spending some time working in
> this area along with Alex Rudyy in the weeks ahead.
>
> Some of the issues to consider:
>
> 1. Non-blocking approach currently leads to correctness issues.
>
> The 0-10 codepath uses a non-blocking Failover model which currently
> fails to protect the client from performing certain operations during
> Failover, and this can lead to unexpected behaviour. For example,
> closing QueueBrowsers during Failover has been observed to cause
> issues because it is possible for the client to send the old
> subscriptions destination in a cancel command to the new broker as the
> close and Failover are allowed to progress concurrently. Failover had
> started but not yet completed the resubscription operations, meaning
> the the new broker didn't yet know about the destination and so has to
> respond by closing the Session with a a NOT_FOUND execution exception.
>
> 2. Transacted sessions
>
> With the 0-10 client, any transacted Sessions in use are currently
> closed upon Failover occurring, and upon next use of the Session the
> client application then gets an exception indicating the Session is
> closed. This seems to give little benefit to users from having
> Failover while using transactions, which to me actually seems like the
> most obvious use case. A further issue with this process is that it is
> completely different from the approach taken by the 0-8/9 codepath,
> making compatibility during upgrade an issue.
>
> Upon Failover occurring, recreating the Session and providing a means
> to indicate the previous transaction was not successful seems like a
> more user friendly thing to do as it is more in line with user
> expectations of how transactions work, and this is exactly what the
> 0-8/9 codepath does. JMS provides a TransactionRolledBackException
> which can be thrown upon commit(), and this is used in the 0-8/0-9
> client codepath when Failover occurs to indicate the transaction is no
> longer valid and was rolled back, allowing the client to simply replay
> their transaction.
>
> 3. [Dead]locks
>
> The current client implementation is rather heavy on locks, and the
> various routes for acquiring them has created situations which can
> result in deadlock. It would be worth investigating a reduction in the
> number of locks required for the client, both to make the
> implementation clearer and to reduce or even remove the possibility of
> deadlocks. (E.g, the recent issues around actually closing
> subscriptions before closing the sessions).
>
> 4. Acks
>
> Currently there are a number of issues with our acknowledgement
> generation process that means we are fairly non-compliant with the JMS
> specification, and the reliability guarantees people are expecting may
> not be met as a result.
>
>
> Robbie
>
> ---------------------------------------------------------------------
> Apache Qpid - AMQP Messaging Implementation
> Project:      http://qpid.apache.org
> Use/Interact: mailto:dev-subscribe@qpid.apache.org
>
>

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org


Mime
View raw message