hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Galed Friedmann <galed.friedm...@onavo.com>
Subject Thrift "hang ups" with no apparent reason
Date Mon, 30 Jan 2012 14:39:06 GMT
Hi,
I have an HBase cluster which consists of 1 master server (running
NameNode, Zoo Keeper and HBase Master) and 3 region masters (Running
DataNode and Region Server).
I also have a Thrift server running on the master.
I have some Hadoop MR jobs running on a separate Hadoop cluster (using
JRuby) and some other processes that use Thrift as the end point to HBase.
All of this on EC2.

Lately we're having weird issues with Thrift, after several hours the
Thrift server "hangs" - the scripts that are using it to access HBase get
connection timeouts, we're also using Heroku and ruby on rails apps that
use Thrift and they simply get stuck. Only when restarting the Thrift
process everything goes back to normal.

I've tried tweaking everything I could, increasing the heap size of the
Thrift process (to 4GB) only delayed the time until the hang ups appear
(from around 4-5 hours to 9-10 hours) but did not fix the problem. Zoo
Keeper and HBase Master also have 4GB heap size.

The Thrift log files show nothing, the only thing I see in the logs are the
establishment of connection when I brought the Thrift up (few hours before
the hangups) and then when I restart it.

Looking at the different log files this is what I see during the time the
hangups start:

*Zoo Keeper log at the time of the hangups, looking at the Thrift process
session ID (0x1352a393d180008 and 0x1352a393d180009): *
2012-01-30 10:51:36,721 WARN org.apache.zookeeper.server.NIOServerCnxn:
EndOfStreamException: Unable to read additional data from client sessionid *
0x1352a393d180008*, likely client has
closed socket
2012-01-30 10:51:36,721 INFO org.apache.zookeeper.server.NIOServerCnxn:
Closed socket connection for client /10.217.55.193:53475 which had
sessionid *0x1352a393d180008*
2012-01-30 10:51:36,721 WARN org.apache.zookeeper.server.NIOServerCnxn:
EndOfStreamException: Unable to read additional data from client sessionid *
0x1352a393d180009*, likely client has
closed socket
2012-01-30 10:51:36,722 INFO org.apache.zookeeper.server.NIOServerCnxn:
Closed socket connection for client /10.217.55.193:53477 which had
sessionid *0x1352a393d180009*
2012-01-30 10:52:00,001 INFO org.apache.zookeeper.server.ZooKeeperServer:
Expiring session 0x1352a393d18051c, timeout of 90000ms exceeded
2012-01-30 10:52:00,001 INFO
org.apache.zookeeper.server.PrepRequestProcessor: Processed session
termination for sessionid: 0x1352a393d18051c
2012-01-30 10:52:06,040 INFO org.apache.zookeeper.server.NIOServerCnxn:
Accepted socket connection from /10.217.55.193:35937
2012-01-30 10:52:06,043 INFO org.apache.zookeeper.server.NIOServerCnxn:
Client attempting to establish new session at /10.217.55.193:35937
2012-01-30 10:52:06,044 INFO org.apache.zookeeper.server.NIOServerCnxn:
Established session 0x1352a393d18051d with negotiated timeout 90000 for
client /10.217.55.193:35937
2012-01-30 10:52:08,820 INFO org.apache.zookeeper.server.NIOServerCnxn:
Accepted socket connection from /10.217.55.193:35940
2012-01-30 10:52:08,821 INFO org.apache.zookeeper.server.NIOServerCnxn:
Client attempting to establish new session at /10.217.55.193:35940
2012-01-30 10:52:08,823 INFO org.apache.zookeeper.server.NIOServerCnxn:
Established session 0x1352a393d18051e with negotiated timeout 90000 for
client /10.217.55.193:35940
2012-01-30 10:52:28,001 INFO org.apache.zookeeper.server.ZooKeeperServer:
Expiring session 0x1352a393d18051b, timeout of 90000ms exceeded
2012-01-30 10:52:28,001 INFO
org.apache.zookeeper.server.PrepRequestProcessor: Processed session
termination for sessionid: 0x1352a393d18051b
2012-01-30 10:52:50,844 INFO org.apache.zookeeper.server.NIOServerCnxn:
Accepted socket connection from /10.64.165.124:47983
2012-01-30 10:52:50,856 INFO org.apache.zookeeper.server.NIOServerCnxn:
Client attempting to establish new session at /10.64.165.124:47983
2012-01-30 10:52:50,858 INFO org.apache.zookeeper.server.NIOServerCnxn:
Established session 0x1352a393d18051f with negotiated timeout 90000 for
client /10.64.165.124:47983
2012-01-30 10:52:54,243 WARN org.apache.zookeeper.server.NIOServerCnxn:
EndOfStreamException: Unable to read additional data from client sessionid
0x1352a393d18051f, likely client has
closed socket
2012-01-30 10:52:54,244 INFO org.apache.zookeeper.server.NIOServerCnxn:
Closed socket connection for client /10.64.165.124:47983 which had
sessionid 0x1352a393d18051f
2012-01-30 10:52:56,001 INFO org.apache.zookeeper.server.ZooKeeperServer:
Expiring session *0x1352a393d180009*, timeout of 90000ms exceeded
2012-01-30 10:52:56,001 INFO org.apache.zookeeper.server.ZooKeeperServer:
Expiring session *0x1352a393d180008*, timeout of 90000ms exceeded
2012-01-30 10:52:56,001 INFO
org.apache.zookeeper.server.PrepRequestProcessor: Processed session
termination for sessionid: *0x1352a393d180009*
2012-01-30 10:52:56,001 INFO
org.apache.zookeeper.server.PrepRequestProcessor: Processed session
termination for sessionid: *0x1352a393d180008*
*
*
*
*
*In addition to that, on one of the Region Servers I found this exception
at the time of the hangup:*
2012-01-30 10:46:23,854 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
8801271291968240625 lease expired
2012-01-30 10:46:23,854 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
4523402662192609713 lease expired
2012-01-30 10:46:23,854 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
-3235593536276390176 lease expired
2012-01-30 10:46:35,034 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
-8329379051383952775 lease expired
2012-01-30 10:51:36,382 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
listener on 60020: readAndProcess threw exception java.io.IOException:
Connection rese
t by peer. Count of bytes read: 0
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcher.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:237)
        at sun.nio.ch.IOUtil.read(IOUtil.java:210)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
        at
org.apache.hadoop.hbase.ipc.HBaseServer.channelRead(HBaseServer.java:1359)
        at
org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:900)
        at
org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:522)
        at
org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:316)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
2012-01-30 10:52:24,016 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
-4511393305838866925 lease expired
2012-01-30 10:52:24,016 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
-5818959718437063034 lease expired
2012-01-30 10:52:24,016 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
-1408921590864341720 lease expired

I would really appreciate the help, I'm kind of losing my mind here over
this. The cluster worked perfectly for long time and recently we've started
having these problems.

Thanks alot!
Galed.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message