trafodion-codereview mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From zcorrea <>
Subject [GitHub] incubator-trafodion pull request #1234: [TRAFODION-2746] Fixed various probl...
Date Sat, 16 Sep 2017 00:06:39 GMT
GitHub user zcorrea opened a pull request:

    [TRAFODION-2746] Fixed various problem detected in large clusters (> 30)

    The problems were:
    1. A segmentation violation occurred during the Integration phase, when the new 
       monitor is establishing the socket communication paths between itself and 
       the existing monitors.
       a. Information is exchanged between the master (creator) monitor and the
          slave (new) monitor process which tells the new monitor which nodes
          monitor process make up the existing cluster instance. During these
          exchanges, in CCluster::ReceiveSock() one of the messages was large
          enough to require chunking and the logic which kept track of the
          number of bytes received was not calculated correctly which resulted
          in an overwrite past the boundary of the receive buffer. 
    2. A second segmentation violation was due to a buffer overwrite during the
       Joining (revive) phase.
       a. In requeue.cxx, when creating the buffer in the master (creator) monitor
          which is populated with the cluster state information to be sent to the
          slave (new) monitor process, the calculation did not properly account for
          the number of logical and physical nodes. So that when the buffer was
          populated, it would overwrite past the allocated buffer.
    3. A third problem was also note in the one of the monitor would remain in
       the Joining state and never come out of it.
       a. The problem was in the order of logic when calling 
          CCluster::ResetIntegratingPNid() which triggers the
          CCommAccept::commAcceptorSock() to accept another new node to
          integrate. The invocation to ResetIntegratingPNid() was done before
          resetting the creator flag. Due to kernel scheduling, this resetting
          of the creator flag was happening after another monitor started the
          Integration phase and it was breaking the node integration protocol
          by terminating it too early. So the new monitor would stay in the
          Joining state for ever since the protocol was broken.
    4. The last segmentation violation was due to stderr buffer overwrite in
       CRedirectStderr::handleOutput() where the size returned by snprintf() 
       was used to terminate the buffer containing stderr data >= 4096 which
       is the size of the buffer.

You can merge this pull request into a Git repository by running:

    $ git pull TRAFODION-2746

Alternatively you can review and apply these changes as the patch at:

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1234
commit 19555630d5c0d63e8a8ea1e02f92545da983cb35
Author: Zalo Correa <>
Date:   2017-09-16T00:02:48Z

    [TRAFODION-2746] Fixed various problem detected in large clusters (> 30)



View raw message