hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tatsuya Kawano <tatsuya6...@gmail.com>
Subject A kernel panic makes small HBase cluster to crush?
Date Sat, 05 Mar 2011 01:28:43 GMT

Hi, 

I got this question at Hadoop User Group Japan mailing list, but I need some helps from the
experts here. It looks like HDFS issue, maybe "append" related?  but I'm not totally sure
yet. 

The person who posted the original question is testing HA features in HBase 0.90.0 and ASF
Hadoop 0.20.2 (with hadoop-core-0.20-append-r1056497.jar)

His test cluster has only 3 nodes. 

Node 1: RS, DN, ZK   plus   HM, NN
Node 2: RS, DN, ZK
Node 3: RS, DN, ZK

dfs.replication = 3


He brought down Node 3 (which was handling Put requests from his test client) by a kernel
panic ("echo c > /proc/sysrq-trigger"). But he also got Region Servers on Node 1 and Node
2 down with the following message. 

---------------------------------------------------------------------
2011-03-01 23:13:13,056 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
server serverName=ap12.secur2,60020,1298987576087, load=(requests=0,
regions=4, usedHeap=218, maxHeap=1998): Replay of HLog required.
Forcing server shutdown
org.apache.hadoop.hbase.DroppedSnapshotException: region:
Object_Speed_Test,
5003017357526424133520110201051038918,1298988549775.1dbc1bf84b48e1145638b3a3bc3ad1cd
---------------------------------------------------------------------

He can easily reproduce this issue on his cluster. 

So, by looking at the above message, I thought there was something wrong with HDFS, and RS
was reading corrupted HFile or something from HDFS. 

Then, we checked HDFS NN and DN logs, and it seems NN was confused and it wasn't able to allocate
block for write. 

---------------------------------------------------------------------
2011-03-01 23:13:13,006 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
ugi=hbase,hadoop        ip=/XX.XX.XX.XX   cmd=create      src=/hbase/
Object_Speed_Test/1dbc1bf84b48e1145638b3a3bc3ad1cd/.tmp/
1275904589980700621    dst=null        perm=hbase:supergroup:rw-r--r--
2011-03-01 23:13:13,048 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 2 on 9000, call addBlock(/hbase/Object_Speed_Test/
1dbc1bf84b48e1145638b3a3bc3ad1cd/.tmp/1275904589980700621,
DFSClient_hb_rs_ap12.secur2,60020,1298987576087_1298987617433, null)
from XX.XX.XX.XX:55462: error: java.io.IOException: File /hbase/
Object_Speed_Test/1dbc1bf84b48e1145638b3a3bc3ad1cd/.tmp/
1275904589980700621 could only be replicated to 0 nodes, instead of 1
java.io.IOException: File /hbase/Object_Speed_Test/
1dbc1bf84b48e1145638b3a3bc3ad1cd/.tmp/1275904589980700621 could only
be replicated to 0 nodes, instead of 1
---------------------------------------------------------------------

It seems the kernel panic on Node 3 put HDFS in a wrong state, so Region Servers couldn't
write to and read from HDFS and had to shut themselves down. 

We couldn't find any more clues in the logs, but I pasted them here: 

http://pastebin.com/NYkNS1c1


Since dfs.replication = 3, all Data Nodes were participating HLog write at the time Node 3
got the kernel panic. I think this somehow made the Name Node to think those Data Nodes were
all gone. But I couldn't find the root cause of this issue. 

Also, he checked the network and disk spaces, and he believes there was no issue on them when
he was testing. 

Thanks, 

--
Tatsuya Kawano
Tokyo, Japan


Mime
View raw message