qpid-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Radu Dan (Created) (JIRA)" <j...@apache.org>
Subject [jira] [Created] (QPID-3860) Python client enters busy loop on heartbeat timeout
Date Mon, 20 Feb 2012 16:59:34 GMT
Python client enters busy loop on heartbeat timeout
---------------------------------------------------

                 Key: QPID-3860
                 URL: https://issues.apache.org/jira/browse/QPID-3860
             Project: Qpid
          Issue Type: Bug
          Components: Python Client
    Affects Versions: 0.14
         Environment: Client: Linux radudan.activesoft.ro 3.2.3-2.fc16.x86_64 #1 SMP Fri Feb
3 20:08:08 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
Server: Linux ophelia.activesoft.ro 2.6.32-220.2.1.el6.x86_64 #1 SMP Fri Dec 23 02:21:33 CST
2011 x86_64 x86_64 x86_64 GNU/Linux

            Reporter: Radu Dan
            Priority: Critical


Let's asume the following code, that's linked against and connected to qpid 0.14:

#!/bin/env python
import sys, time, logging
from qpid.messaging import *
logging.basicConfig(level=logging.INFO)

connection = Connection(
        host="ophelia.activesoft.ro",
        port=5672,
        heartbeat=1,
        reconnect=True,
        reconnect_interval_min=1,
        reconnect_interval_max=1,
        reconnect_limit=5,
        reconnect_timeout=5
)

connection.open()
session = connection.session()
sender = session.sender("test; {create: always}")

id = 0
while True:
        try:
                sender.send({
                        "hello": "world"
                })
                logging.info("Sent %d" % id)
                id += 1
        except KeyboardInterrupt:
                break
        except:
                logging.info("Error")
        finally:
                time.sleep(1)

The script will try to publish a message on the "test" queue to a broker, using a heartbeat
value of 1s, and will reconnect at most 5 times with a timeout of 1s before giving up and
reporting the message as failed. This works as expected when gracefully shutting the broker
down (service qpidd stop):

<<daemon is running>>
INFO:root:Sent 0
INFO:root:Sent 1
INFO:root:Sent 2
INFO:root:Sent 3
INFO:root:Sent 4
<<daemon has been stopped>>
WARNING:qpid.messaging:recoverable error[attempt 0]: connection aborted
WARNING:qpid.messaging:trying: ophelia.activesoft.ro:5672
<<daemon is still stopped, yet a reconnect apparently succeeds>>
WARNING:qpid.messaging:reconnect succeeded: ophelia.activesoft.ro:5672
WARNING:qpid.messaging:recoverable error[attempt 0]: [Errno 104] Connection reset by peer
WARNING:qpid.messaging:trying: ophelia.activesoft.ro:5672
WARNING:qpid.messaging:recoverable error[attempt 1]: [Errno 111] Connection refused
WARNING:qpid.messaging:sleeping 1 seconds
WARNING:qpid.messaging:trying: ophelia.activesoft.ro:5672
WARNING:qpid.messaging:recoverable error[attempt 2]: [Errno 111] Connection refused
WARNING:qpid.messaging:sleeping 1 seconds
WARNING:qpid.messaging:trying: ophelia.activesoft.ro:5672
WARNING:qpid.messaging:recoverable error[attempt 3]: [Errno 111] Connection refused
WARNING:qpid.messaging:sleeping 1 seconds
WARNING:qpid.messaging:trying: ophelia.activesoft.ro:5672
WARNING:qpid.messaging:recoverable error[attempt 4]: [Errno 111] Connection refused
WARNING:qpid.messaging:sleeping 1 seconds
WARNING:qpid.messaging:trying: ophelia.activesoft.ro:5672
WARNING:qpid.messaging:recoverable error[attempt 5]: [Errno 111] Connection refused
WARNING:qpid.messaging:sleeping 1 seconds
WARNING:qpid.messaging:trying: ophelia.activesoft.ro:5672
<<gives up here after 5 attempts>>
INFO:root:Error
INFO:root:Error
INFO:root:Error
<<daemon has been started>>
INFO:root:Error
INFO:root:Error

I don't know how exactly it gets a reconnect succeeded right after the daemon is killed, but
it seems to reconnect 5 times, afterwards it will fail with some exception. However, it makes
no more attempts to establish a connection when this state is reached. Now I'm not entirely
sure about the philosophy behind the API, but wouldn't it be more useful to try and reconnect
if down whenever a new message is sent? Right now the default behaviour is to block until
the connection is back up; Tweaking the reconnect parameters will yield a call that throws
an exception after a while, but that invalidates the entire connection>session>sender
chain

This behaviour can be duplicated by kill -9, because the kernel will shutdown and close the
TCP sockets associated with any killed processes. However, an even greater problem arises
when disconnecting the interface (I just pulled the ether cable out of my test box, but service
network stop should probably do the same thing). This way, the socket never really closes
because it's waiting for the other end to shutdown(), which may never come because the other
end might have already given up on the connection (via a hard reset for example). Since most
linux TCP stacks are EXTREMELY patient with regards to socket timeouts, it's up to the heartbeat
mechanism to detect a connection failure and switch to the same state as a shutdown/close
would.

However, what happens is something like this:

INFO:root:Sent 0
INFO:root:Sent 1
INFO:root:Sent 2
INFO:root:Sent 3
INFO:root:Sent 4
WARNING:qpid.messaging:recoverable error[attempt 0]: heartbeat timeout

At this point, the python process enters a busy loop of some sort from which it will never
recover (plugging the ether cable back in doesn't seem to do anything). Switching logging
to debug doesn't output anything past the heartbeat timeout.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org


Mime
View raw message