qpid-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rafael Schloming <rafa...@redhat.com>
Subject Re: MessageListenerMultiConsumerTest deadlock when run in 0-10 code path using ant
Date Mon, 03 Mar 2008 12:33:39 GMT
Arnaud Simon wrote:
> On Mon, 2008-03-03 at 10:22 +0000, Aidan Skinner wrote:
>> On Fri, Feb 29, 2008 at 8:23 PM, Rafael Schloming <rafaels@redhat.com> wrote:
>>> The deadlock is between the session's _messageDeliveryLock and the
>>>  Dispatcher's _lock. The dispatcher thread's main loop attempts to
>>>  acquire _lock first and then _messageDeliveryLock. The main thread
>>>  acquires _messageDeliveryLock first thing on close(...) and then
>>>  subsequently many levels down the stack attempts to acquire _lock in
>>>  order to call Dispatcher.rejectPending(...).
>> We found (and fixed) several similar problems on M2, it's probably
>> worth having a look at that code to make sure there aren't any cases
>> being missed.
>> - Aidan
> Rafael is absolutely right, Session.close is holding a lock on
> _messageDeliveryLock only to prevent messages being delivered to
> consumers when the session is closing (note that this lock should also
> be held when consumer are closing, see Qpid-812). There is however no
> reason for the dispatcher thread to keep a lock on _lock when
> dispatching a message. _lock is used for stopping/starting the
> dispatcher thread and _messageDeliveryLock for preventing message
> delivery. So, we should change the run method of dispatcher for
> releasing _lock and rollback and rejectPending for synchronizing on
> _messageDeliveryLock. 

This was my first thought too, however I think there actually is a 
reason. Connection.stop() is supposed to block until all message 
listeners are finished with whatever they're currently work on. Holding 
_lock while executing the message listener does accomplish this. If we 
were to release the lock earlier then stop() could return while the 
message listener is still in progress.

> Note that this deadlock situation would not happen with 0.8 as an
> external thread is calling rejectPending but (if the dispatcher is
> trying to dispatch a message) this thread will need to wait until close
> finishes and will create a new dispatcher thread that will not be
> closed.

I thought this too, however on closer examination the closing thread 
does a syncWrite, which means that thread will block until the 
rejectPending is complete, so it should be the same as executing in the 
same thread.

I suspect the potential for the deadlock to occur is actually still 
there on the 0-8 codepath, but it is not occuring for some reason. I 
don't know if this is timing related or some other reason. The stack 
trace suggests that the dispatcher thread is running as we attempt to 
cancel the consumer. Stopping the thread should prevent the deadlock, I 
don't know if the 0-8 code does that or not.

> The way the dispatcher thread is closed should also be changed. The
> close operation currently interrupts the thread this means that the
> dispatcher does not have a chance to acquire a lock on
> _messageDeliveryLock for correctly releasing the message it may be
> waiting to dispatch. 

Agreed, the way it checks a private _closed variable rather than 
checking the session's _closed variable looks a bit dodgy also.


View raw message