cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jackson Chung (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-4740) Phantom TCP connections, failing hinted handoff
Date Thu, 22 Nov 2012 02:00:58 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13502535#comment-13502535
] 

Jackson Chung commented on CASSANDRA-4740:
------------------------------------------

we too see a similar thing:

On box 192.168.13.56 , looking for all ESTABLISHED connection for others connecting to this:7000
{panel}
$ netstat -ant | grep "192.168.13.56:7000.*EST" | cut -d ':' -f 1-2 | sort | uniq -c
      1 tcp        0      0 192.168.13.56:7000          192.168.12.13
      2 tcp        0      0 192.168.13.56:7000          192.168.14.145
    217 tcp        0      0 192.168.13.56:7000          192.168.44.237
    202 tcp        0      0 192.168.13.56:7000          192.168.45.67
    198 tcp        0      0 192.168.13.56:7000          192.168.46.141
     11 tcp        0      0 192.168.13.56:7000          192.168.76.156
     10 tcp        0      0 192.168.13.56:7000          192.168.77.72
     11 tcp        0      0 192.168.13.56:7000          192.168.78.153
{panel}

On 192.168.44.237 , it just shows 1 ESTABLISHED to 192.168.13.56:7000:
{panel}
$ sudo netstat -antp | grep "192.168.44.237.*192.168.13.56:7000" 
tcp        0      0 192.168.44.237:35252        192.168.13.56:7000          ESTABLISHED 14398/java
{panel}

We too have HH problem similar to the above (though I don't see in the logs on the above 2
nodes that the timedout happen to these 2 nodes). We also have nodes flapping. And it also
turned out the firewall rule wasn't opened on some nodes to communicate to all nodes on port
7000. 

restarting the node fix the issue. 

version:
{panel}
uname -a
Linux kca06apigee 3.2.21-1.32.6.amzn1.x86_64 #1 SMP Sat Jun 23 02:32:15 UTC 2012 x86_64 x86_64
x86_64 GNU/Linux

$ /usr/java/latest/bin/java -version
java version "1.6.0_31"
Java(TM) SE Runtime Environment (build 1.6.0_31-b04)
Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode)
{panel}

How does netstat on 1 box shows 200+ ESTABLISHED conn to the other box while the other box
only show 1....
                
> Phantom TCP connections, failing hinted handoff
> -----------------------------------------------
>
>                 Key: CASSANDRA-4740
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4740
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 1.1.2
>         Environment: Linux 3.4.9, java 1.6.0_35-b10
>            Reporter: Mina Naguib
>            Priority: Minor
>              Labels: connection, handoff, hinted, orphan, phantom, tcp, zombie
>         Attachments: write_latency.png
>
>
> IP addresses in report anonymized:
> Had a server running cassandra (1.1.1.10) reboot ungracefully.  Reboot and startup was
successful and uneventful.  cassandra went back into service ok.
> From that point onwards however, several (but not all) machines in the cassandra cluster
started having difficulty with hinted handoff to that machine.  This was despite nodetool
ring showing Up across the board.
> Here's an example of an attempt, every 10 minutes, by a node (1.1.1.11) to replay hints
to the node that was rebooted:
> {code}
> INFO [HintedHandoff:1] 2012-10-01 11:07:23,293 HintedHandOffManager.java (line 294) Started
hinted handoff for token: 122879743610338889583996386017027409691 with IP: /1.1.1.10
> INFO [HintedHandoff:1] 2012-10-01 11:07:33,295 HintedHandOffManager.java (line 372) Timed
out replaying hints to /1.1.1.10; aborting further deliveries
> INFO [HintedHandoff:1] 2012-10-01 11:07:33,295 HintedHandOffManager.java (line 390) Finished
hinted handoff of 0 rows to endpoint /1.1.1.10
> INFO [HintedHandoff:1] 2012-10-01 11:17:23,312 HintedHandOffManager.java (line 294) Started
hinted handoff for token: 122879743610338889583996386017027409691 with IP: /1.1.1.10
> INFO [HintedHandoff:1] 2012-10-01 11:17:33,319 HintedHandOffManager.java (line 372) Timed
out replaying hints to /1.1.1.10; aborting further deliveries
> INFO [HintedHandoff:1] 2012-10-01 11:17:33,319 HintedHandOffManager.java (line 390) Finished
hinted handoff of 0 rows to endpoint /1.1.1.10
> INFO [HintedHandoff:1] 2012-10-01 11:27:23,335 HintedHandOffManager.java (line 294) Started
hinted handoff for token: 122879743610338889583996386017027409691 with IP: /1.1.1.10
> INFO [HintedHandoff:1] 2012-10-01 11:27:33,337 HintedHandOffManager.java (line 372) Timed
out replaying hints to /1.1.1.10; aborting further deliveries
> INFO [HintedHandoff:1] 2012-10-01 11:27:33,337 HintedHandOffManager.java (line 390) Finished
hinted handoff of 0 rows to endpoint /1.1.1.10
> INFO [HintedHandoff:1] 2012-10-01 11:37:23,357 HintedHandOffManager.java (line 294) Started
hinted handoff for token: 122879743610338889583996386017027409691 with IP: /1.1.1.10
> INFO [HintedHandoff:1] 2012-10-01 11:37:33,358 HintedHandOffManager.java (line 372) Timed
out replaying hints to /1.1.1.10; aborting further deliveries
> INFO [HintedHandoff:1] 2012-10-01 11:37:33,359 HintedHandOffManager.java (line 390) Finished
hinted handoff of 0 rows to endpoint /1.1.1.10
> INFO [HintedHandoff:1] 2012-10-01 11:47:23,412 HintedHandOffManager.java (line 294) Started
hinted handoff for token: 122879743610338889583996386017027409691 with IP: /1.1.1.10
> INFO [HintedHandoff:1] 2012-10-01 11:47:33,414 HintedHandOffManager.java (line 372) Timed
out replaying hints to /1.1.1.10; aborting further deliveries
> INFO [HintedHandoff:1] 2012-10-01 11:47:33,414 HintedHandOffManager.java (line 390) Finished
hinted handoff of 0 rows to endpoint /1.1.1.10
> {code}
> I started poking around, and discovered that several nodes held ESTABLISHED TCP connections
that didn't have a live endpoint on the rebooted node.  My guess is they were live prior to
the reboot, and after the reboot the nodes still see them as live and unsuccessfully try to
use them.
> Example, on the node that was rebooted:
> {code}
> .10 ~ # netstat -tn | grep 1.1.1.11
> tcp        0      0 1.1.1.10:7000        1.1.1.11:40960        ESTABLISHED
> tcp        0      0 1.1.1.10:34370       1.1.1.11:7000         ESTABLISHED
> tcp        0      0 1.1.1.10:45518       1.1.1.11:7000         ESTABLISHED
> {code}
> While on the node that's failing to hint to it:
> {code}
> .11 ~ # netstat -tn | grep 1.1.1.10
> tcp        0      0 1.1.1.11:7000         1.1.1.10:34370       ESTABLISHED
> tcp        0      0 1.1.1.11:7000         1.1.1.10:45518       ESTABLISHED
> tcp        0      0 1.1.1.11:7000         1.1.1.10:53316       ESTABLISHED
> tcp        0      0 1.1.1.11:7000         1.1.1.10:43239       ESTABLISHED
> tcp        0      0 1.1.1.11:40960        1.1.1.10:7000        ESTABLISHED
> {code}
> Notice the phantom connections on :53316 and :43239 which do not appear on the remote
1.1.1.10
> On .11 I tried disabling and enabling gossip, but that did not restart :7000 nor clean
up the 2 phantom connections.  For good measure I also tried disabling and enabling thrift
(long shot), and that didn't help either.
> The only thing that helped was to actually stop and start cassandra, in a rolling fashion,
on each node that was having trouble hinting to the machine that was rebooted.  The phantom
connections naturally disappeared, write volume on 1.1.1.10 rose for a while, and all the
hints were sent successfully.
> I'm unsure whether the phantom TCP connections are a cause, or just loosely correlated,
to the hinted handoff failure every 10 minutes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message