qpid-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "rajith attapattu" <rajit...@gmail.com>
Subject Re: Review Request: Address deadlock btw _messageDeliveryLock and _failoverMutex
Date Tue, 11 Jun 2013 23:10:46 GMT


> On May 13, 2013, 3:22 p.m., Robbie Gemmell wrote:
> > http://svn.apache.org/repos/asf/qpid/trunk/qpid/java/client/src/main/java/org/apache/qpid/client/BasicMessageConsumer.java,
lines 744-745
> > <https://reviews.apache.org/r/10738/diff/2/?file=288993#file288993line744>
> >
> >     What impact does dropping the message on the floor here have on the client?
> >     
> >     Earlier points in the call hierarchy seem to make effort to do other things
with the message when detecting session/consumer close, so is there any impact from not doing
so? E.g a message getting stuck acquired for a now-closed consumer?
> >     
> >     Does any particular attention need paid to the overridden 0-10 specific version
of this method?
> 
> rajith attapattu wrote:
>     The message will be dropped if a consumer (or session is closed or closing).
>     When a consumer is closed, any messages acquired but not acknowledged should be be
made available to another consumer by the broker.
>     These messages will be marked redelivered.
>     Is this the same situation for 0-8/0-9 ?
>     
>     The situation is the same as a consumer closing with a bunch of unacked messages
when in CLIENT_ACK mode.
>     Alternatively we could reject the message (release in 0-10 terms). But I don't think
this is required, given that we will be closing the consumer anyways.
>     
>     
>     >Earlier points in the call hierarchy seem to make effort to do other things with
the message when detecting session/consumer close
>     By "other things" are you referring to a reject? In 0-10 AFIK you don't need to do
it. 
>     The situation is the same as a consumer closing with a bunch of unacked messages
when in CLIENT_ACK mode.
>     
>     >Does any particular attention need paid to the overridden 0-10 specific version
of this method?
>     IMO adding "if (!(isClosed() || isClosing()))" is required to prevent the 0-10 specific
method from issuing credit (and receiving more messages, that will eventually get released).
>     There is a chance where the consumer could be marked closed, but not yet reached
the point of sending a cancel, by the time messageFlow() is called.
>     The above check should prevent it.
> 
> Robbie Gemmell wrote:
>     "Did you mean race condition (instead of deadlock)?"
>     
>     I really meant deadlock. One of the uses of _messageDeliveryLock being removed was
added as part of the fix for a deadlock (https://issues.apache.org/jira/browse/QPID-3911)
and so I wonder what effect removing it again will have. It may be the other changes mean
it isn't a problem, but I think it needs to be closely considered. 
>     
>     "The message will be dropped if a consumer (or session is closed or closing).
>     When a consumer is closed, any messages acquired but not acknowledged should be be
made available to another consumer by the broker.
>     These messages will be marked redelivered.
>     Is this the same situation for 0-8/0-9 ?"
>     
>     That isn't actually the case, transferred messages are the responsibility of the
session and remain acquired after consumer close until such point as the session explicitly
does something with them. I believe this is true of 0-8/9/91 as well, and probably explains
the origin of the reject/release code I referred to earlier in the call tree than the changes
would now allow the message to be silently dropped.
>     
>     From the 0-10 spec: "Canceling a subscription MUST NOT affect pending transfers.
A transfer made prior to canceling transfers to the destination MUST be able to be accepted,
released, acquired, or rejected after the subscription is canceled."
> 
> rajith attapattu wrote:
>     Indeed, Gordon mentioned that to me and the 2nd patch (the one before I did today)
takes care of rejecting messages from the consumer. We don't need to do the same when the
session is closing, as when the session ends, the any unacked messages are put back on the
queue.
> 
> rajith attapattu wrote:
>     As for QPID-3911,
>     There is a deadlock, albeit a bit different (involving the same locks) from QPID-3911,
that do happen in similar circumstances.
>     However this deadlock appears to manifest with or without this patch, which leads
me to believe that _messageDeliveryLock is not the right solution for QPID-3911.
>     Sadly the solution for QPID-3911 made it worse as there are at least two distinct
cases of deadlocks involving _messageDeliveryLock.
>     1. Btw _lock and _messageDeliveryLock
>     2. Btw _messageDeliveryLock and _failoverMutex.
>     
>     We definitely need to find a solution for the deadlocks (at least 3 cases) btw failoverMutex
and _lock (in AMQSession), which seems to be the root of all evil :)
>     We might have to drop one lock (most likely _lock) and see if we can provide an alternative
strategy to guarantee correct behaviour.
>
> 
> rajith attapattu wrote:
>     Sorry I meant dopping _failoverMutex (not _lock in AMQSession).
>     It might also be an opportunity to fix our less than stellar failover.
> 
> Robbie Gemmell wrote:
>     The deadlock itself wasn't really the main issue in QPID-3911, it was just a very
obvious symptom that presented after the underlying problem had occurred, which was that the
client sent illegal commands to the broker due to allowing things to occur at the same time
(/in the wrong order) that it shouldn't had, namely suspending the session(/consumers) during
session rollback inside onMessage() at the same time as a consumer close occurring on the
session in another thread. The deliveryLock presented a route to stop that by preventing the
possibility that the dispatcher would be inside user code at the time and thus make it unable
to cause the problem.
>     
>     As you say, while the changes fixed that problem it seems they introduced another.
It so far appears to me like that is what using the current changes here would be doing, by
returning the client to the state where it could be delivering messages on the session [for
another consumer] that could allow rollback while a consumer close is in progress, but with
the distinction that we would know in advance this time that there could be a problem with
the change.
>     
>     Having a quick look through the client code, it seems like an avenue you could pursue
to fix the issue (for 0-10 at least) might be the AMQSession suspensionLock. It is only used
inside AMQSession currently, in a way that looks like its acquisition wont be followed by
executing user code or acquiring any more locks, and so at first glance looks like it could
be safe to acquire during the consumer close as well in order to prevent the concurrent suspend+close
that actually caused the problem for 0-10 in QPID-3911. That would only be part of the fix
though as it would probably allow a consumer close to block the session rollback only for
that to then proceed afterwards and cause the same illegal commands to be sent. To prevent
this it seems like ensuring we don't operate on closing/closed consumers in AMQSession_0_10#sendSuspendChannel(..)
would stop the client issuing the stop/flow commands for the now-closing/closed consumers.
I haven't spent enough time looking at this to know if there are any side effects from doing
that though, or if that would fix the related issues seen with 0-8/9/91 which were slightly
different and didn't actually involve a deadlock but rather a command timeout, so such things
would need investigated if you pursued the approach.

What you describe could happen and obtaining the suspension lock could be a potential solution.
The same happens when the broker closes the session while rollback is in progress.
In both situations a deadlock happens.

Another potential solution is (which I think we should do anyways) is to check if the session
is marked closed before we send anything to the broker.
As soon as we enter session close() we should mark the session as being closed.
There are a lot of methods where we just send commands on the wire without checking this and
it is causing issues.

I will investigate both.


- rajith


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10738/#review20483
-----------------------------------------------------------


On May 21, 2013, 2:48 p.m., rajith attapattu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/10738/
> -----------------------------------------------------------
> 
> (Updated May 21, 2013, 2:48 p.m.)
> 
> 
> Review request for qpid, Robbie Gemmell, Weston Price, and Rob Godfrey.
> 
> 
> Description
> -------
> 
> There are at least 3 cases where the deadlock btw _messageDeliveryLock and _failoverMutex
surfaces. Among them sending a message inside onMessage() and the session being closed due
to an error (causing the deadlock) seems to come up a lot in production environments. There
is also a deadlock btw _messageDeliveryLock and _lock (AMQSession.java) which happens less
frequently.
> The messageDeliveryLock is used to ensure that we don't close the session in the middle
of a message delivery. In order to do this we hold the lock across onMessage().
> This causes several issues in addition to the potential to deadlock. If an onMessage
call takes longer/wedged then you cannot close the session or failover will not happen until
it returns as the same thread is holding the failoverMutex. 
> 
> Based on an idea from Rafi, I have come up with a solution to get rid of _messageDeliveryLock
and instead use an alternative strategy to achieve similar functionality.
> In order to ensure that close() doesn't proceed until the message deliveries currently
in progress completes, an atomic counter is used to keep track of message deliveries in progress.
> The close() will wait until the count falls to zero before proceeding. No new deliveries
will be initiated bcos the close method will mark the session as closed.
> The wait has a timeout to ensure that a longer running or wedged onMessage() will not
hold up session close.
> There is a slim chance that before a session being marked as closed a message delivery
could be initiated, but not yet gotten to the point of updating the counter, hence waitForMsgDeliveryToFinish()
will see it as zero and proceed with close. But in comparison to the issues with _messageDeliveryLock,
I believe it's acceptable.
> 
> There is an issue if MessageConsumer close is called outside of Session close. This can
be solved in a similar manner. I will wait until the current review is complete and then post
the solution for the MessageConsumer close.
> I will commit them both together.
> 
> 
> This addresses bug QPID-4574.
>     https://issues.apache.org/jira/browse/QPID-4574
> 
> 
> Diffs
> -----
> 
>   http://svn.apache.org/repos/asf/qpid/trunk/qpid/java/client/src/main/java/org/apache/qpid/client/AMQConnection.java
1484812 
>   http://svn.apache.org/repos/asf/qpid/trunk/qpid/java/client/src/main/java/org/apache/qpid/client/AMQSession.java
1484812 
>   http://svn.apache.org/repos/asf/qpid/trunk/qpid/java/client/src/main/java/org/apache/qpid/client/BasicMessageConsumer.java
1484812 
>   http://svn.apache.org/repos/asf/qpid/trunk/qpid/java/client/src/main/java/org/apache/qpid/client/BasicMessageConsumer_0_10.java
1484812 
>   http://svn.apache.org/repos/asf/qpid/trunk/qpid/java/client/src/main/java/org/apache/qpid/client/util/ConditionManager.java
PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/10738/diff/
> 
> 
> Testing
> -------
> 
> Java test suite, tests from customers and QE around the deadlock situation.
> 
> 
> Thanks,
> 
> rajith attapattu
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message