qpid-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rafael Schloming <rafa...@redhat.com>
Subject Re: MessageListenerMultiConsumerTest deadlock when run in 0-10 code path using ant
Date Mon, 03 Mar 2008 12:45:20 GMT
The attached patch appears to fix the deadlock by merging _lock and 
_messageDeliveryLock into one. Unfortunately this needs to be 
conditional on 0-8 vs 0-10 codepath because merging the locks for 0-8 
causes the same deadlock that it prevents on 0-10 since on 0-8 
rejectPending is executed in another thread, and the closing thread 
blocks until that is complete.

I'm not entirely happy with this as a solution since I'd like to 
understand if the potential for deadlock is still there in the 0-8 code.


Rafael Schloming wrote:
> Arnaud Simon wrote:
>> On Mon, 2008-03-03 at 10:22 +0000, Aidan Skinner wrote:
>>> On Fri, Feb 29, 2008 at 8:23 PM, Rafael Schloming 
>>> <rafaels@redhat.com> wrote:
>>>> The deadlock is between the session's _messageDeliveryLock and the
>>>>  Dispatcher's _lock. The dispatcher thread's main loop attempts to
>>>>  acquire _lock first and then _messageDeliveryLock. The main thread
>>>>  acquires _messageDeliveryLock first thing on close(...) and then
>>>>  subsequently many levels down the stack attempts to acquire _lock in
>>>>  order to call Dispatcher.rejectPending(...).
>>> We found (and fixed) several similar problems on M2, it's probably
>>> worth having a look at that code to make sure there aren't any cases
>>> being missed.
>>> - Aidan
>> Rafael is absolutely right, Session.close is holding a lock on
>> _messageDeliveryLock only to prevent messages being delivered to
>> consumers when the session is closing (note that this lock should also
>> be held when consumer are closing, see Qpid-812). There is however no
>> reason for the dispatcher thread to keep a lock on _lock when
>> dispatching a message. _lock is used for stopping/starting the
>> dispatcher thread and _messageDeliveryLock for preventing message
>> delivery. So, we should change the run method of dispatcher for
>> releasing _lock and rollback and rejectPending for synchronizing on
>> _messageDeliveryLock. 
> This was my first thought too, however I think there actually is a 
> reason. Connection.stop() is supposed to block until all message 
> listeners are finished with whatever they're currently work on. Holding 
> _lock while executing the message listener does accomplish this. If we 
> were to release the lock earlier then stop() could return while the 
> message listener is still in progress.
>> Note that this deadlock situation would not happen with 0.8 as an
>> external thread is calling rejectPending but (if the dispatcher is
>> trying to dispatch a message) this thread will need to wait until close
>> finishes and will create a new dispatcher thread that will not be
>> closed.
> I thought this too, however on closer examination the closing thread 
> does a syncWrite, which means that thread will block until the 
> rejectPending is complete, so it should be the same as executing in the 
> same thread.
> I suspect the potential for the deadlock to occur is actually still 
> there on the 0-8 codepath, but it is not occuring for some reason. I 
> don't know if this is timing related or some other reason. The stack 
> trace suggests that the dispatcher thread is running as we attempt to 
> cancel the consumer. Stopping the thread should prevent the deadlock, I 
> don't know if the 0-8 code does that or not.
>> The way the dispatcher thread is closed should also be changed. The
>> close operation currently interrupts the thread this means that the
>> dispatcher does not have a chance to acquire a lock on
>> _messageDeliveryLock for correctly releasing the message it may be
>> waiting to dispatch. 
> Agreed, the way it checks a private _closed variable rather than 
> checking the session's _closed variable looks a bit dodgy also.
> --Rafael

View raw message