qpid-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "michael goulish (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DISPATCH-106) pn link corruption after router restart
Date Tue, 03 Feb 2015 13:10:34 GMT

    [ https://issues.apache.org/jira/browse/DISPATCH-106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303225#comment-14303225
] 

michael goulish commented on DISPATCH-106:
------------------------------------------

In server.c, the function thread_run() has this code:

            if (qdpn_connector_failed(cxtr))
                qdpn_connector_close(cxtr);
            else
                work_done = process_connector(qd_server, cxtr);

By removing the "else" we got my test to go to 148 iterations before failing.  And the crash
is much different from what I have been seeing.
Before this change, the test almost always failed no later than iteration 3.  So -- bug fixed.

why:

Because when the connector has failed, there are still some events on it that need to be processed.
 When they get processed, the links associated with this connection get cleaned up properly.
 If you don't do this final processing of events on the dead connector, the dispatch code
will still have dead links sitting around pointing to some memory that will (usually) get
freed by proton.  Boom.



> pn link corruption after router restart
> ---------------------------------------
>
>                 Key: DISPATCH-106
>                 URL: https://issues.apache.org/jira/browse/DISPATCH-106
>             Project: Qpid Dispatch
>          Issue Type: Bug
>          Components: Router Node
>    Affects Versions: 0.3
>            Reporter: michael goulish
>             Fix For: 0.4
>
>
> With the standard 6-node demo network,  (A-D, X, Y)  after killing and restarting node
Y, I see a bad link on router D -- which causes D to crash.
> Here is sequence of events from logs of routers and the topologist testing program:
>   01:05:05.367 Killing router Y, pid 20074
>   01:05:05.367 Sleeping 30 seconds
>   01:05:35.367 Restarting router Y, pid 20120
>   01:05:38     Router D : last "valid origins" post to its log file :
>                Node QDR.C valid origins: []
>   01:05:46     Router D posts to its log file:
>                Exited Router Flux Mode
>   01:06:05.368 checking for crash after node bounce
>                ( no crash detected )
>   01:06:17     last post to router D log file
>                ROUTER_LS (trace) RCVD: RA(id=QDR.X area=0 inst=1422165872 ls_seq=2 mobile_seq=0)
>   01:06:35.369 second check for crash. (none detected)
>   01:06:35.370 getting topology
>                ( Node D fails to respond.  PID 20072 )
>                ( core file, timestamped 01:06 )
>   here is backtrace from router D's core file
>   {
>     #0  pn_string_get (string=0xfdfdfdfdbabecafe) at /home/mick/rh-qpid-proton/proton-c/src/object/string.c:120
>     #1  0x00007ff73fa8e752 in qd_router_link_name (link=0x7ff72800b2d0) at /home/mick/dispatch/src/router_agent.c:112
>     #2  0x00007ff73fa8e7dd in qd_entity_refresh_router_link (entity=0x7ff7300c9b50, impl=0x7ff72800b2d0)
>         at /home/mick/dispatch/src/router_agent.c:120
>     #3  0x0000003e40805d8c in ffi_call_unix64 () from /lib64/libffi.so.6
>     #4  0x0000003e408056bc in ffi_call () from /lib64/libffi.so.6
>     #5  0x00007ff737d2dc8b in _ctypes_callproc () from /usr/lib64/python2.7/lib-dynload/_ctypes.so
>     #6  0x00007ff737d27a85 in PyCFuncPtr_call () from /usr/lib64/python2.7/lib-dynload/_ctypes.so
>     #7  0x00000036df44a0d3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
>     #8  0x00000036df4de37c in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
>     #9  0x00000036df4e21dd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
>     #10 0x00000036df4e088f in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
>     #11 0x00000036df4e21dd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
>     #12 0x00000036df4e088f in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
>     #13 0x00000036df4e21dd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
>     #14 0x00000036df46f0d8 in ?? () from /lib64/libpython2.7.so.1.0
>     #15 0x00000036df44a0d3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
>     #16 0x00000036df4590c5 in ?? () from /lib64/libpython2.7.so.1.0
>     #17 0x00000036df44a0d3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
>     #18 0x00000036df44a1b5 in ?? () from /lib64/libpython2.7.so.1.0
>     #19 0x00000036df44a29e in PyObject_CallFunction () from /lib64/libpython2.7.so.1.0
>     #20 0x00007ff73fa8d77f in qd_io_rx_handler (context=0x7ff736321e68, msg=0x7ff728019bd0,
link_id=0
>         at /home/mick/dispatch/src/python_embedded.c:519
>     #21 0x00007ff73fa92533 in router_rx_handler (context=0x1db5fd0, link=0x7ff730008710,
delivery=0x7ff73004cc50)
>         at /home/mick/dispatch/src/router_node.c:922
>     #22 0x00007ff73fa7fa16 in do_receive (pnd=0x1e359a0) at /home/mick/dispatch/src/container.c:221
>     #23 0x00007ff73fa7fea3 in process_handler (container=0x1dbd6f0, unused=0x1e0a050,
qd_conn=0x1e2c6a0)
>         at /home/mick/dispatch/src/container.c:362
>     #24 0x00007ff73fa80135 in handler (handler_context=0x1dbd6f0, conn_context=0x1e0a050,
event=QD_CONN_EVENT_PROCESS,
>         qd_conn=0x1e2c6a0) at /home/mick/dispatch/src/container.c:438
>     #25 0x00007ff73fa98346 in process_connector (qd_server=0x1d78460, cxtr=0x1e1b9b0)
>         at /home/mick/dispatch/src/server.c:322
>     #26 0x00007ff73fa98c1f in thread_run (arg=0x1d70d30) at /home/mick/dispatch/src/server.c:546
>     #27 0x0000003e3dc07ee5 in start_thread () from /lib64/libpthread.so.0
> ...
> }
>   Let's go up to qd_router_link_name
>   at /home/mick/dispatch/src/router_agent.c:112
>   (gdb) print * link
>         $1 =
>         {
>           prev = 0x7ff72800b210,
>           next = 0x7ff72800b390,
>           mask_bit = 3,
>           link_type = QD_LINK_ROUTER,
>           link_direction = QD_OUTGOING,
>           owning_addr = 0x1d7d6c0,
>           waypoint = 0x0,
>           link = 0x7ff7280099d0,
>           connected_link = 0x0,
>           ref = 0x7ff72800f350,
>           target = 0x0,
>           event_fifo =
>           {
>             head = 0x0,
>             tail = 0x0,
>             scratch = 0x0,
>             size = 0
>           },
>           msg_fifo =
>           {
>             head = 0x7ff73003c230,
>             tail = 0x7ff73003bb70,
>             scratch = 0x7ff73003b9f0,
>             size = 102
>           }
>         }
>   (gdb) print * (link->link)
>         $2 =
>         {
>           pn_sess = 0x7ff72804b7b0,
>           pn_link = 0x7ff72804d6a0,
>           context = 0x7ff72800b2d0,
>           node = 0x1db6bb0,
>           drain_mode = false
>         }
>   (gdb) print * (link->link->pn_link)
> $3 = {
>   endpoint = {
>     type = 33686018,
>     state = 33686018,
>     error = 0x202020202020202,
>     condition = {
>       name = 0x202020202020202,
>       description = 0x202020202020202,
>       info = 0x202020202020202
>     },
>     remote_condition = {
>       name = 0x202020202020202,
>       description = 0x202020202020202,
>       info = 0x202020202020202
>     },
>     endpoint_next = 0x202020202020202,
>     endpoint_prev = 0x202020202020202,
>     transport_next = 0x202020202020202,
>     transport_prev = 0x202020202020202,
>     modified = 2,
>     freed = 2,
>     posted_final = 2
>   },
>   source = {
>     address = 0x202020202020202,
>     properties = 0x202020202020202,
>     capabilities = 0x202020202020202,
>     outcomes = 0x202020202020202,
>     filter = 0x202020202020202,
>     durability = (PN_DELIVERIES | unknown: 33686016),
>     expiry_policy = 33686018,
>     timeout = 33686018,
>     type = 33686018,
>     distribution_mode = (PN_DIST_MODE_MOVE | unknown: 33686016),
>     dynamic = 2
>   },
>   target = {
>     address = 0x202020202020202,
>     properties = 0x202020202020202,
>     capabilities = 0x202020202020202,
>     outcomes = 0x202020202020202,
>     filter = 0x202020202020202,
>     durability = (PN_DELIVERIES | unknown: 33686016),
>     expiry_policy = 33686018,
> ( etc.  -- it's all garbage. )



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@qpid.apache.org
For additional commands, e-mail: dev-help@qpid.apache.org


Mime
View raw message