cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paulo Motta (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-11286) streaming socket never times out
Date Tue, 08 Mar 2016 19:24:40 GMT


Paulo Motta commented on CASSANDRA-11286:

There is a relatively common scenario when the receiver receives a {{COMPLETE}} message from
the sender, changes its state to {{WAIT_COMPLETE}} and then the {{IncomingMessageHandler}}
blocks on the socket again, and meanwhile the {{OnCompletionRunnable}} finishes processing
the received stream and closes the session via {{maybeCompleted()}}. Even though the session
is terminated correctly on both sides, after {{streaming_socket_timeout_in_ms}} the user may
get a harmless but scary {{StreamSocketTimeout}} log message from the previously blocked socket
(got this in dtests a few times). In order to fix this I updated the patch to close the {{OutgoingStreamHandler}}
after sending the last {{COMPLETE}} message, and the {{IncomingStreamHandler}} after receiving
the last {{COMPLETE}} message, so all resources are properly released.

Below is an example of this scenario:
DEBUG [STREAM-IN-/] 2016-03-08 15:31:18,632 - [Stream
#f19e82b0-e55b-11e5-9a50-f59bd42ef741] Received Complete
DEBUG [STREAM-OUT-/] 2016-03-08 15:31:19,004 - [Stream
#f19e82b0-e55b-11e5-9a50-f59bd42ef741] Sending Complete
DEBUG [StreamReceiveTask:1] 2016-03-08 15:31:19,004 - [Stream #f19e82b0-e55b-11e5-9a50-f59bd42ef741]
Closing stream connection handler on /
INFO  [StreamReceiveTask:1] 2016-03-08 15:31:19,005 - [Stream
#f19e82b0-e55b-11e5-9a50-f59bd42ef741] Session with / is complete
INFO  [StreamReceiveTask:1] 2016-03-08 15:31:19,025 - [Stream
#f19e82b0-e55b-11e5-9a50-f59bd42ef741] All sessions completed

.... {{streaming_socket_timeout}} passes...

ERROR [STREAM-IN-/] 2016-03-08 15:31:19,638 - [Stream #f19e82b0-e55b-11e5-9a50-f59bd42ef741]
Streaming error occurred null
        at$ ~[na:1.8.0_66]
        at ~[na:1.8.0_66]
        at java.nio.channels.Channels$ ~[na:1.8.0_66]
        at org.apache.cassandra.streaming.messages.StreamMessage.deserialize(
        at org.apache.cassandra.streaming.ConnectionHandler$
        at [na:1.8.0_66]

I also noticed that {{replace_address_test.TestReplaceAddress.replace_with_reset_resume_state_test}}
(CASSANDRA-11246) and {{replace_address_test.TestReplaceAddress.resumable_replace_test}} was
failing for the same reason as {{}}
(CASSANDRA-10912), so I applied the fix of CASSANDRA-10167 (decrease the socket timeout) to
the [dtest PR|], which should avoid hanging
on these tests due to default {{streaming_socket_timeout_in_ms}} of 1 hour.

I rebased and updated branches and resubmitted tests:

> streaming socket never times out
> --------------------------------
>                 Key: CASSANDRA-11286
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Streaming and Messaging
>            Reporter: Paulo Motta
>            Assignee: Paulo Motta
> While trying to reproduce CASSANDRA-8343 I was not able to trigger a {{SocketTimeoutException}}
by adding an artificial sleep longer than {{streaming_socket_timeout_in_ms}}.
> After investigation, I detected two problems:
> * {{ReadableByteChannel}} creation via {{socket.getChannel()}}, as done in {{ConnectionHandler.getReadChannel(socket)}},
does not respect {{socket.setSoTimeout()}}, as explained in this [blog post|]
> ** bq. The only difference between “blocking NIO” and “NIO wrapped around IO”
is that you can’t use socket timeout with SocketChannels. Why ? Read a javadoc for setSocketTimeout().
It says that this timeout is used only by streams.
> * {{socketSoTimeout}} is never set on "follower" side, only on initiator side via {{DefaultConnectionFactory.createConnection(peer)}}.
> This may cause streaming to hang indefinitely, as exemplified by CASSANDRA-8621:
> bq. For the scenario that prompted this ticket, it appeared that the streaming process
was completely stalled. One side of the stream (the sender side) had an exception that appeared
to be a connection reset. The receiving side appeared to think that the connection was still
active, at least in terms of the netstats reported by nodetool. We were unable to verify whether
this was specifically the case in terms of connected sockets due to the fact that there were
multiple streams for those peers, and there is no simple way to correlate a specific stream
to a tcp session.

This message was sent by Atlassian JIRA

View raw message