cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam Horwich (JIRA)" <>
Subject [jira] [Created] (CASSANDRA-8620) Bootstrap session hanging indefinitely
Date Wed, 14 Jan 2015 17:08:35 GMT
Adam Horwich created CASSANDRA-8620:

             Summary: Bootstrap session hanging indefinitely
                 Key: CASSANDRA-8620
             Project: Cassandra
          Issue Type: Bug
         Environment: Debian 7, Oracle JDK 1.7.0_51, AWS + GCE
            Reporter: Adam Horwich

Hi! We have been running a relatively small 2.1.2 cluster over 2 DCs for a few months with
~100GB load per node and a RF=3 and over the last few weeks have been trying to scale up capacity.

We've been recently seeing scenarios in which the Bootstrap or Unbootstrap streaming process
hangs indefinitely for one or more sessions on the receiver without stacktrace or exception.
This does not happen every time, and we do not get into this state with the same sender every
time. When the receiver is in a hung state, the following can be found in the thread dump:

The Stream-IN thread for one or more sessions is blocked in the following state: 
Thread 24942: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
 - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=186 (Compiled
 - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() @bci=42,
line=2043 (Compiled frame)
 - java.util.concurrent.ArrayBlockingQueue.take() @bci=20, line=374 (Compiled frame)
 - @bci=31, line=89 (Compiled
 - @bci=4, line=337 (Compiled frame)
 - org.apache.cassandra.utils.BytesReadTracker.readUnsignedShort() @bci=4, line=140 (Compiled
 - org.apache.cassandra.utils.ByteBufferUtil.readShortLength( @bci=1, line=317
(Compiled frame)
 - org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength( @bci=2,
line=327 (Compiled frame)
 - org.apache.cassandra.db.composites.AbstractCType$Serializer.deserialize(
@bci=5, line=397 (Compiled frame)
 - org.apache.cassandra.db.composites.AbstractCType$Serializer.deserialize(
@bci=2, line=381 (Compiled frame)
 - org.apache.cassandra.db.OnDiskAtom$Serializer.deserializeFromSSTable(,
org.apache.cassandra.db.ColumnSerializer$Flag, int,$Version)
@bci=10, line=75 (Compiled frame)
 - org.apache.cassandra.db.AbstractCell$1.computeNext() @bci=25, line=52 (Compiled frame)
 - org.apache.cassandra.db.AbstractCell$1.computeNext() @bci=1, line=46 (Compiled frame)
 - @bci=9, line=143 (Compiled
 - @bci=61, line=138 (Compiled frame)
@bci=320, line=283 (Compiled frame)
 - org.apache.cassandra.streaming.StreamReader.writeRow(,, org.apache.cassandra.db.ColumnFamilyStore) @bci=26, line=157 (Compiled
@bci=258, line=89 (Compiled frame)
 - org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(java.nio.channels.ReadableByteChannel,
int, org.apache.cassandra.streaming.StreamSession) @bci=69, line=48 (Interpreted frame)
 - org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(java.nio.channels.ReadableByteChannel,
int, org.apache.cassandra.streaming.StreamSession) @bci=4, line=38 (Interpreted frame)
 - org.apache.cassandra.streaming.messages.StreamMessage.deserialize(java.nio.channels.ReadableByteChannel,
int, org.apache.cassandra.streaming.StreamSession) @bci=37, line=55 (Interpreted frame)
 - org.apache.cassandra.streaming.ConnectionHandler$ @bci=24,
line=245 (Interpreted frame)
 - @bci=11, line=744 (Interpreted frame)

Debug logging shows that the receiver is still reading the file it is receiving from the receiver
and has not yet sent an ACK.

The receiver is waiting for more data to finish writing its row, and the sender is not sending
any more data. On both the receiver and sender there is a large amount of data (~5MB) stuck
in the Recv-Q (receiver) and Send-Q (sender). 

We've been trying to diagnose this issue internally, but it's difficult to create a reliably
reproducible scenario. We have so far found that restarting all nodes in the cluster and ensuring
that a cleanup has been performed helps mitigate the problem (though a cleanup without restart
can still result in a hung state). However, it's unclear to me why either of these would affect
the streaming process in the way we've observed. One theory is that the calculated section
sizes are inaccurate.

We tried setting a timeout on the dataBuffer reading (moving take to poll), which forced a
retry on the file, but the same transfer failed again until retries were exhausted.

>From a heap dump in the dead locked state we confirmed the following:

ByteTracker.bytesRead = 100477411
Length of the section being read (happens to be the last section) 100477411

We also wrote all the data in the buffer to a text file and observed that the data in the
buffer ended half way through a row.

The stream reader checks the value of bytes read after importing each row (StreamReader line
97) to detect the end of the section, however it does not expect the data to finish half way
through a row. This happens in OnDiskAtom.deserializeFromSSTable in any of the various deserialization

This message was sent by Atlassian JIRA

View raw message