hawq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Guo <paul...@gmail.com>
Subject Check segment node status in QD polling.
Date Mon, 06 Feb 2017 11:46:03 GMT
I encountered a QD timeout issue recently. This happens when:

1) One of the segment node panick and then restarted.
2) Other segments hangs in interconnect (there will be retry until timeout).

QD stays at loop of poll()  until either QE reports error after
interconnect timeout or libpq (QD<->QE) reports error with timeout since
the socket is configured with kernel tcp keepalive. This is bad since the
default timeout seconds of both detection solutions are long (1 hour and
2hours on my test systems) although we could modify the default values, I'm
wondering if we could have a better and controllable solution - To use the
RM heartbeat mechanism:

RM maintains a global ID lists (stable cross node adding or removing) for
all nodes and keeps updating the health state via userspace heartbeat
mechanism, thus we could maintain a bitmap in shared memory which keeps the
latest node healthy info updated then we could use it in QD code, i.e.
Cancel the query if finding the segment node, which handles part of the
query, is down.

Any idea? Thanks.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message