ignite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Igor Kamyshnikov (JIRA)" <j...@apache.org>
Subject [jira] [Created] (IGNITE-10469) TcpCommunicationSpi does not break tcp connection after IdleConnectionTimeout seconds of inactivity
Date Thu, 29 Nov 2018 12:23:00 GMT
Igor Kamyshnikov created IGNITE-10469:
-----------------------------------------

             Summary: TcpCommunicationSpi does not break tcp connection after IdleConnectionTimeout
seconds of inactivity
                 Key: IGNITE-10469
                 URL: https://issues.apache.org/jira/browse/IGNITE-10469
             Project: Ignite
          Issue Type: Bug
          Components: cache
    Affects Versions: 2.6, 2.5
            Reporter: Igor Kamyshnikov
         Attachments: GridTcpCommunicationSpiIdleCommunicationTimeoutTest.java, ignite_idle_test.zip

TcpCommunicationSpi does not close TCP connections after they have been idle for more than
configured in TcpCommunicationSpi#idleConnTimeout amount of time (default is 10 minutes).

There are environments where idle TCP connections become unusable: connections remain ESTABLISHED
while actual data to be sent piles up in Send-Q (according to netstat). For this reason Ignite
stack does not recognize a communication problem for a considerable amount of time (~ 10-15
minutes), and it does not begin its reconnection procedure (hearbeats use different tcp connections
that are not idle and don't have this issue).

I've discovered though there is a logic in the Ignite code to detect and close idle connections.
But due to a problem in the code it does not work reliably.

This is a test that _sometimes_ reproduces the problem.
[^ignite_idle_test.zip] - full test project
[^GridTcpCommunicationSpiIdleCommunicationTimeoutTest.java] - just test code

What's the problem in the Ignite code?

There are two loops in the Ignite code that have a chance to close idle connections:
1) org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.CommunicationWorker#processIdle
- this one is executed each *IdleConnectionTimeout* milliseconds. (it can close idle connections
but it typically turns out that it thinks that connection is not idle, thanks to the second
loop).
2) org.apache.ignite.internal.util.nio.GridNioServer.AbstractNioClientWorker#bodyInternal
-> org.apache.ignite.internal.util.nio.GridNioServer.AbstractNioClientWorker#checkIdle
- this loop executes:
{noformat}
filterChain.onSessionIdleTimeout(ses); <-- does not actually close an idle connection
// Update timestamp to avoid multiple notifications within one timeout interval.
ses.resetSendScheduleTime(); <--- resets idle timer
ses.bytesReceived(0);
{noformat}

---
To wind up, may be the whole approach should be reviewed:
 - is it ok not to track message delivery time?
 - is it ok not to do heartbeating using the same connections as for get/put/... commands?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message