hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: Crashing Region Servers
Date Thu, 09 Dec 2010 19:40:04 GMT
Lance,

Both those lines indicate the problem:

 IPC Server handler 13 on 60020 took 182416ms
 Client session timed out, have not heard from server in 182936ms

It's very clear that your region servers are suffering from
pause-of-the-world garbage collection issues. Basically this one GC'ed
for 3 minutes, which is over the 1 minute timeout to consider a region
server dead. The rest is just the side effect of that.

This subject comes often on the mailing list, for example:
http://search-hadoop.com/m/t97q027tr7f2

Make sure you don't swap, give plenty of RAM to HBase, use LZO
compression, don't underprovision your machines, etc

J-D

On Thu, Dec 9, 2010 at 11:27 AM, Lance Riedel <lanceriedel@gmail.com> wrote:
>
> We have a 6 node cluster, 5 with region serves.  2 of the region servers have been stable
for days, but 3 of them keep crashing.  Here are the logs around around when the crash occurs.
(btw, we are shoving approximately the twitter firehose into hbase via flume) I'm an hbase
newbie, but I have been reading. Not sure what else is needed to help debug. When the problem
occurs, it is concurrent with issues that appear all the way down in the data node.  Also,
interesting to note, all 3 servers seem to fail differently somewhat:
>
> Servers (6):
> 00.hadoop is the master
> 01.hadoop, 03.hadoop (there is no 02.hadoop - flaky machine that got taken out) - are
stable Region Servers
> 04-06.hadoop crash at different times, very different logs, Region Servers
>
>
> Master Attributes:
> HBase Version 0.89.20100924+28, r
> Hadoop Version 0.20.2+737, r98c55c28258aa6f42250569bd7fa431ac657bdbd
>
> HBase Root Directory hdfs://00.hadoop.****:54310/hbase
>
> Load average
> 103.0
> Average number of regions per regionserver. Naive computation.
> Regions On FS
> 577
>
>
> (This is after recently restarting 04-06)
>
>
> Address   Start Code    Load
> 01.hadoop:60030  1291268910638   requests=223, regions=118, usedHeap=519, maxHeap=987
> 03.hadoop:60030  1291269219610   requests=23, regions=111, usedHeap=862, maxHeap=987
> 04.hadoop:60030  1291910665912   requests=169, regions=82, usedHeap=194, maxHeap=888
> 05.hadoop:60030  1291909584060   requests=232, regions=110, usedHeap=477, maxHeap=888
> 06.hadoop60030   1291909723787   requests=99, regions=94, usedHeap=394, maxHeap=888
> Total:
> servers: 5
> requests=746, regions=515
>
>
>
> Following are various logs around the time of the failures for 04-06
>
> ************************************************************************************
> 04.hadoop
> ************************************************************************************
> REGION SERVER ->
> 2010-12-09 05:01:58,640 INFO org.apache.hadoop.hbase.regionserver.HRegion: compaction
completed on region article,a83858a08f2270d319f75a7b43c756c2453988e7,1291811929868.58f6d9fc80
> c78f3ca490b0280b4f1226. in 0sec
> 2010-12-09 05:04:30,898 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats:
total=1.46 MB, free=176.33 MB, max=177.79 MB, blocks=0, accesses=200583, hits=0, hitRatio=
> 0.00%%, evictions=0, evicted=0, evictedPerRun=NaN
> 2010-12-09 05:04:59,612 INFO org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter:
Using syncFs -- HDFS-200
> 2010-12-09 05:04:59,618 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/04.hadoop.****,60020,1291858770125/10.100.154.103%3A60020.1291870585253,
entries=5318
> 6, filesize=63776311. New hlog /hbase/.logs/04.hadoop.****,60020,1291858770125/10.100.154.103%3A60020.1291871099582
> 2010-12-09 05:08:02,033 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Found 1
hlogs to remove  out of total 6; oldest outstanding sequenceid is 32891801 from region article
> ,4b2039b791e894dd479b90661ca97087f61645d3,1291813201039.7dff47c63bb5648bb1ee6670c60553a5.
> 2010-12-09 05:08:02,034 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: IPC Server
handler 13 on 60020 took 182416ms appending an edit to hlog; editcount=0
> 2010-12-09 05:08:02,066 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: moving old
hlog file /hbase/.logs/04.hadoop.****,60020,1291858770125/10.100.154.103%3A60020.12918679476
> 65 whose highest sequenceid is 32880227 to /hbase/.oldlogs/10.100.154.103%3A60020.1291867947665
> 2010-12-09 05:08:02,066 INFO org.apache.zookeeper.ClientCnxn: Client session timed out,
have not heard from server in 182936ms for sessionid 0x12ca58c08ff0048, closing socket conn
> ection and attempting reconnect
> 2010-12-09 05:08:02,089 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: Aborting
region server serverName=04.hadoop.****,60020,1291858770125, load=(requests=24, regions=
> 64, usedHeap=548, maxHeap=888): Unhandled exception
> org.apache.hadoop.hbase.YouAreDeadException: org.apache.hadoop.hbase.YouAreDeadException:
Server REPORT rejected; currently processing 04.hadoop.****,60020,1291858770125 as dead s
> erver
>        at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:217)
>        at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:271)
>        at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:744)
>        at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:561)
>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
>
>        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
       at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>        at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
>        at org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
>        at org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
>        at org.apache.hadoop.hbase.regionserver.HRegionServer.doRun(HRegionServer.java:635)
>        at org.apache.hadoop.hbase.regionserver.HRegionServer.access$000(HRegionServer.java:126)
>        at org.apache.hadoop.hbase.regionserver.HRegionServer$1.run(HRegionServer.java:518)
>        at org.apache.hadoop.hbase.regionserver.HRegionServer$1.run(HRegionServer.java:516)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at javax.security.auth.Subject.doAs(Subject.java:337)
>        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1046)
>        at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:516)
>        at java.lang.Thread.run(Thread.java:662)
> 2010-12-09 05:08:02,090 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Dump
of metrics: request=0.0, regions=64, stores=64, storefiles=136, storefileIndexSize=27, memsto
> reSize=353, compactionQueueSize=0, usedHeap=549, maxHeap=888, blockCacheSize=1530552,
blockCacheFree=184893160, blockCacheCount=0, blockCacheHitRatio=0
> 2010-12-09 05:08:02,090 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 60020
> 2010-12-09 05:08:02,090 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler
0 on 60020: exiting2010-12-09 05:08:02,090 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
IPC Server listener on 60020
>
>
> MASTER HBASE ->
>
> 2010-12-09 05:05:06,072 INFO org.apache.hadoop.hbase.master.BaseScanner: All 1 .META.
region(s) scanned
> 2010-12-09 05:05:40,004 INFO org.apache.hadoop.hbase.master.ServerManager: 04.hadoop.jive,60020,1291858770125
znode expired
> 2010-12-09 05:05:40,005 DEBUG org.apache.hadoop.hbase.master.ServerManager: Added=04.hadoop.jive,60020,1291858770125
to dead servers, added shutdown processing operation
> 2010-12-09 05:05:40,005 DEBUG org.apache.hadoop.hbase.master.RegionServerOperationQueue:
Processing todo: ProcessServerShutdown of 04.hadoop.jive,60020,12918587701252010-12-09 05:05:40,005
INFO org.apache.hadoop.hbase.master.RegionServerOperation: Process shutdown of server 04.hadoop.jive,60020,1291858770125:
logSplit: false, rootRescanned: f
> alse, numberOfMetaRegions: 1, onlineMetaRegions.size(): 1
> 2010-12-09 05:05:40,008 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Splitting
7 hlog(s) in hdfs://00.hadoop.jive:54310/hbase/.logs/04.hadoop.jive,60020,12918587701252010-12-09
05:05:40,008 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Splitting hlog 1 of 7: hdfs://00.hadoop.jive:54310/hbase/.logs/04.hadoop.jive,60020,1291858770125/10.1
> 00.154.103%3A60020.1291867947665, length=637577092010-12-09 05:05:40,008 INFO org.apache.hadoop.hbase.util.FSUtils:
Recovering filehdfs://00.hadoop.jive:54310/hbase/.logs/04.hadoop.jive,60020,1291858770125/10.100.154.103%3A60020
> .1291867947665
>
>
> NAMENODE ->
>
> 2010-12-09 05:08:02,471 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock:
blockMap updated: 10.100.159.13:50010 is added to blk_1531008743226086399_251615 size 63757709
> 2010-12-09 05:08:02,473 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 54310,
call complete(/hbase/.logs/04.hadoop.jive,60020,1291858770125/10.100.154.103%3A60020.1291867947665,
DFSClient_hb_m_10.194.194.79:60000_1291788452343) from 10.194.194.79:44117: error: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException:
No lease on /hbase/.logs/04.hadoop.jive,60020,1291858770125/10.100.154.103%3A60020.1291867947665
File does not exist. [Lease.  Holder: DFSClient_hb_m_10.194.194.79:60000_1291788452343, pendingcreates:
1]
> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /hbase/.logs/04.hadoop.jive,60020,1291858770125/10.100.154.103%3A60020.1291867947665
File does not exist. [Lease.  Holder: DFSClient_hb_m_10.194.194.79:60000_1291788452343, pendingcreates:
1]
>        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1488)
>        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1479)
>        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:1534)
>        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:1522)
>        at org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:610)
>        at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:528)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1319)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1315)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at javax.security.auth.Subject.doAs(Subject.java:396)
>        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063)
>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1313)
> 2010-12-09 05:08:04,206 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock:
blockMap updated: 10.99.97.106:50010 is added to blk_-734280257049179934_251614 size 7330
>
> DATANODE ->
>
> 2010-12-09 05:08:02,212 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException
in BlockReceiver.lastNodeRun: java.io.IOException: Broken pipe
>        at sun.nio.ch.FileDispatcher.write0(Native Method)
>        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)        at
sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:100)        at sun.nio.ch.IOUtil.write(IOUtil.java:71)
>        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
>        at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
       at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
       at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
>        at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
       at java.io.DataOutputStream.writeLong(DataOutputStream.java:207)
>        at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.write(DataTransferProtocol.java:133)
       at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.lastDataNodeRun(BlockReceiver.java:840)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:870)
>        at java.lang.Thread.run(Thread.java:662)
> 2010-12-09 05:08:02,213 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: checkDiskError:
exception:
> 2010-12-09 05:08:02,213 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: checkDiskError:
exception: java.io.IOException: Broken pipe
>        at sun.nio.ch.FileDispatcher.write0(Native Method)
>        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
>        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:100)
>        at sun.nio.ch.IOUtil.write(IOUtil.java:71)
>        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
>        at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
>        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>        at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
>        at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
>        at java.io.DataOutputStream.writeLong(DataOutputStream.java:207)
>        at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.write(DataTransferProtocol.java:133)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.lastDataNodeRun(BlockReceiver.java:840)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:870)
>        at java.lang.Thread.run(Thread.java:662)
> 2010-12-09 05:08:02,221 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
blk_-8817504198034990390_251613 0 Exception java.io.IOException: Broken pipe
>        at sun.nio.ch.FileDispatcher.write0(Native Method)
>        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
>        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:100)
>        at sun.nio.ch.IOUtil.write(IOUtil.java:71)
>        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
>        at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
>        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>        at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
>        at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
>        at java.io.DataOutputStream.writeLong(DataOutputStream.java:207)
>        at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.write(DataTransferProtocol.java:133)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.lastDataNodeRun(BlockReceiver.java:840)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:870)
>        at java.lang.Thread.run(Thread.java:662)
>
> 2010-12-09 05:08:02,221 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
0 for block blk_-8817504198034990390_251613 terminating
> 2010-12-09 05:08:02,224 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
blk_-8817504198034990390_251613 received exception java.io.IOException: Connection reset by
peer
> 2010-12-09 05:08:02,224 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling
block blk_4505306716668305567_250752 file /dist1/data/hadoop-data/current/subdir19/blk_4505306716668305567
for deletion
> 2010-12-09 05:08:02,226 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.100.154.103:50010,
storageID=DS-866555524-10.100.154.103-50010-1291262739506, infoPort=50075, ipcPort=50020):DataXceiver
> java.io.IOException: Connection reset by peer
>        at sun.nio.ch.FileDispatcher.read0(Native Method)
>        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:237)
>        at sun.nio.ch.IOUtil.read(IOUtil.java:210)
>        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
>        at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
>        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>        at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>        at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>        at java.io.DataInputStream.read(DataInputStream.java:132)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:267)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:357)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:378)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:534)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)
> 2010-12-09 05:08:02,226 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted
block blk_1422641301942366074_250732 at file /dist1/data/hadoop-data/current/subdir19/blk_1422641301942366074
> 2010-12-09 05:08:02,227 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted
block blk_4505306716668305567_250752 at file /dist1/data/hadoop-data/current/subdir19/blk_4505306716668305567
>
>
>
> ************************************************************************************
> 05.hadoop
> ************************************************************************************
> REGION SERVER ->
> 2010-12-09 04:29:25,047 DEBUG org.apache.hadoop.hbase.regionserver.Store: Compaction
size of data: 250.5m; Skipped 1 file(s), size: 206896367
> 2010-12-09 04:29:25,047 INFO org.apache.hadoop.hbase.regionserver.Store: Started compaction
of 2 file(s) in data of article,060c4b75067378c045eead9edec2d4fb7d8f3e5e,1291812563166.95e92e97823e11e57850720a63433a62.
 into hdfs://00.hadoop.****:54310/hbase/article/95e92e97823e11e57850720a63433a62/.tmp, sequenceid=33050221
> 2010-12-09 04:29:26,328 INFO org.apache.hadoop.hbase.regionserver.Store: Completed compaction
of 2 file(s) in data of article,060c4b75067378c045eead9edec2d4fb7d8f3e5e,1291812563166.95e92e97823e11e57850720a63433a62.;
new storefile is hdfs://00.hadoop.****:54310/hbase/article/95e92e97823e11e57850720a63433a62/data/2787768957794924068;
store size is 250.5m
> 2010-12-09 04:29:26,328 INFO org.apache.hadoop.hbase.regionserver.HRegion: compaction
completed on region article,060c4b75067378c045eead9edec2d4fb7d8f3e5e,1291812563166.95e92e97823e11e57850720a63433a62.
in 1sec
> 2010-12-09 04:29:26,328 INFO org.apache.hadoop.hbase.regionserver.HRegion: Starting compaction
on region article,ed0cdc7ae6db73432e2ca47c973d545bee11ddd8,1291813400028.85a8d31d099f409dd36f3bcc4be1eba6.
> 2010-12-09 04:29:26,330 INFO org.apache.hadoop.hbase.regionserver.HRegion: compaction
completed on region article,ed0cdc7ae6db73432e2ca47c973d545bee11ddd8,1291813400028.85a8d31d099f409dd36f3bcc4be1eba6.
in 0sec
> 2010-12-09 04:32:33,081 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGIONSERVER_STOP
> 2010-12-09 04:32:33,081 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 60020
> 2010-12-09 04:32:33,089 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor
exception  for block blk_2012253533858803341_250248java.io.EOFException
>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>        at java.io.DataInputStream.readLong(DataInputStream.java:399)
>        at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:120)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2621)
>
> 2010-12-09 04:32:33,089 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional
data from server sessionid 0x12ca58c08ff0049, likely server has closed socket, closing sock
> et connection and attempting reconnect
> 2010-12-09 04:32:33,089 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: IPC Server
handler 5 on 60020 took 75446ms appending an edit to hlog; editcount=50505
> 2010-12-09 04:32:33,090 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 18
on 60020: exiting
> 2010-12-09 04:32:33,139 INFO org.apache.hadoop.hdfs.DFSClient: Error Recovery for block
blk_2012253533858803341_250248 waiting for responder to exit.
> 2010-12-09 04:32:33,139 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 8
on 60020: exiting
> 2010-12-09 04:32:33,139 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler
1 on 60020: exiting
> 2010-12-09 04:32:33,139 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 10
on 60020: exiting
> 2010-12-09 04:32:33,140 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler
0 on 60020: exiting2010-12-09 04:32:33,140 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC
Server handler 5 on 60020: exiting
> 2010-12-09 04:32:33,139 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler
8 on 60020: exiting2010-12-09 04:32:33,140 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC
Server handler 4 on 60020: exiting
> 2010-12-09 04:32:33,140 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler
3 on 60020: exiting2010-12-09 04:32:33,142 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC
Server handler 7 on 60020: exiting
> 2010-12-09 04:32:33,142 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler
9 on 60020: exiting2010-12-09 04:32:33,143 INFO org.apache.hadoop.hbase.regionserver.HRegionServer:
Stopping infoServer
> 2010-12-09 04:32:33,143 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler
2 on 60020: exiting
> 2010-12-09 04:32:33,143 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC Server Responder
> 2010-12-09 04:32:33,143 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler
6 on 60020: exiting
> 2010-12-09 04:32:33,144 INFO org.mortbay.log: Stopped SelectChannelConnector@0.0.0.0:60030
>
>
>
>
> NAMENODE ->
> 2010-12-09 04:32:08,408 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.startFile:
failed to create file /hbase/.logs/05.hadoop.jive,60020,1291862294517/10.100.159.13%3A60020.1291868481555
for DFSClient_hb_m_10.194.194.79:60000_1291788452343 on client 10.194.194.79, because this
file is already being created by NN_Recovery on 10.100.159.13
> 2010-12-09 04:32:08,408 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 54310,
call append(/hbase/.logs/05.hadoop.jive,60020,1291862294517/10.100.159.13%3A60020.1291868481555,
DFSClient_hb_m_10.194.194.79:60000_1291788452343) from 10.194.194.79:59118: error: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:
failed to create file /hba
> se/.logs/05.hadoop.jive,60020,1291862294517/10.100.159.13%3A60020.1291868481555 for DFSClient_hb_m_10.194.194.79:60000_1291788452343
on client 10.194.194.79, because this file is already being created by NN_Recovery on 10.100.159.13
> org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hbase/.logs/05.hadoop.jive,60020,1291862294517/10.100.159.13%3A60020.1291868481555
for DFSClient_hb_m_10.194.194.79:60000_1291788452343 on client 10.194.194.79, because this
file is already being created by NN_Recovery on 10.100.159.13
>        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1194)
       at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1282)
>        at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:541)
       at sun.reflect.GeneratedMethodAccessor37.invoke(Unknown Source)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
       at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:528)        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1319)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1315)        at
java.security.AccessController.doPrivileged(Native Method)
>        at javax.security.auth.Subject.doAs(Subject.java:396)        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1063)
>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1313)2010-12-09 04:32:09,411
WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.startFile: failed to create file
/hbase/.logs/05.hadoop.jive,60020,1291862294517/10.100.159.13%3A6
> 0020.1291868481555 for DFSClient_hb_m_10.194.194.79:60000_1291788452343 on client 10.194.194.79,
because this file is already being created by NN_Recovery on 10.100.159.132010-12-09 04:32:09,411
INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 54310, call append(/hbase/.logs/05.hadoop.jive,60020,1291862294517/10.100.159.13%3A60020.1291868
> 481555, DFSClient_hb_m_10.194.194.79:60000_1291788452343) from 10.194.194.79:59118: error:
org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hbase/.logs/05.hadoop.jive,60020,1291862294517/10.100.159.13%3A60020.1291868481555
for DFSClient_hb_m_10.194.194.79:60000_1291788452343 on client 10.194.194.79, because this
file is
> already being created by NN_Recovery on 10.100.159.13org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:
failed to create file /hbase/.logs/05.hadoop.jive,60020,1291862294517/10.100.159.13%3A60020.1291868481555
for DFSClie
> nt_hb_m_10.194.194.79:60000_1291788452343 on client 10.194.194.79, because this file
is already being created by NN_Recovery on 10.100.159.13        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1194)
>        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1282)
       at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:541)
>        at sun.reflect.GeneratedMethodAccessor37.invoke(Unknown Source)        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:528)
>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1319)        at
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1315)
>
>
>
> DATANODE
>
> 2010-12-09 04:32:09,534 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
blk_2012253533858803341_250248 1 : Thread is interrupted.
> 2010-12-09 04:32:09,534 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.100.159.13:50010,
storageID=DS-145997007-10.100.159.13-50010-1291262743182, infoPort=50075, ipcPort=50020):DataXceiver
> java.io.IOException: Interrupted receiveBlock
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:579)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)
> 2010-12-09 04:32:09,534 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
1 for block blk_2012253533858803341_250248 terminating
>
>
> ************************************************************************************
> 06.hadoop
> ************************************************************************************
> REGION SERVER ->
>
> 2010-12-09 06:17:47,215 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException;
article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f.
> 2010-12-09 06:17:47,430 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException;
article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f.
> 2010-12-09 06:17:47,549 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException;
article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f.
> 2010-12-09 06:17:47,557 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException;
article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f.
> 2010-12-09 06:17:47,563 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException;
article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f.
> 2010-12-09 06:17:48,619 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException;
article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f.
> 2010-12-09 06:17:48,619 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException;
article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f.
> 2010-12-09 06:17:48,619 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException;
article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f.
> 2010-12-09 06:17:48,619 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException;
article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f.
> 2010-12-09 06:17:48,619 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException;
article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f.
> 2010-12-09 06:17:49,703 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException;
article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f.
> 2010-12-09 06:17:49,703 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException;
article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f.
> 2010-12-09 06:17:49,703 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException;
article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f.
> 2010-12-09 06:17:49,705 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException;
article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f.
> 2010-12-09 06:17:49,729 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: NotServingRegionException;
article,8d36892f14dfa6d17d08900743960d0b0973fc96,1291812431639.cce2b650859704eb184df8fb5c5eee6f.
>
> A bunch of these ^
>
> Then a little later::
> ....
>
> 2010-12-09 06:23:23,561 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread:
Compaction requested for region article,993143fca3613014990cc6975cacdd62f4147e03,12918133247
> 53.19413f33fc660536d9d4f315a1283440. because: regionserver60020.cacheFlusher
> 2010-12-09 06:23:27,058 INFO org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter:
Using syncFs -- HDFS-200
> 2010-12-09 06:23:27,080 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/06.hadoop.****,60020,1291864620851/10.99.97.106%3A60020.1291875047947,
entries=53315,
>  filesize=63781925. New hlog /hbase/.logs/06.hadoop.****,60020,1291864620851/10.99.97.106%3A60020.1291875807006
> 2010-12-09 06:23:27,080 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Found 1
hlogs to remove  out of total 8; oldest outstanding sequenceid is 33526881 from region article
> ,1221b0013c61fd7446cb822f698d0bcafa487342,1291872610788.bb2aa6ae8c694a096a00e7bb91fc46c3.
> 2010-12-09 06:23:27,080 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: moving old
hlog file /hbase/.logs/06.hadoop.****,60020,1291864620851/10.99.97.106%3A60020.1291872064424
>  whose highest sequenceid is 33506639 to /hbase/.oldlogs/10.99.97.106%3A60020.1291872064424
> 2010-12-09 06:25:39,162 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: IPC Server
handler 12 on 60020 took 119034ms appending an edit to hlog; editcount=3683
> 2010-12-09 06:25:39,162 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
java.io.IOException: Broken pipe
>        at sun.nio.ch.FileDispatcher.write0(Native Method)
>        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
>        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:100)
>        at sun.nio.ch.IOUtil.write(IOUtil.java:71)
>        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
>        at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
>        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>        at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
>        at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
>        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
>        at java.io.DataOutputStream.write(DataOutputStream.java:90)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2509)
> 2010-12-09 06:25:39,163 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor
exception  for block blk_1141147913164105086_258331java.io.EOFException
>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>        at java.io.DataInputStream.readLong(DataInputStream.java:399)
>        at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:120)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2621)
>
> 2010-12-09 06:25:39,163 INFO org.apache.hadoop.hdfs.DFSClient: Error Recovery for block
blk_1141147913164105086_258331 waiting for responder to exit.
> 2010-12-09 06:25:39,164 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor
exception  for block blk_-7076574401277381592_258324java.io.EOFException
>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>        at java.io.DataInputStream.readLong(DataInputStream.java:399)
>        at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:120)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2621)
>
> 2010-12-09 06:25:39,164 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block
blk_-7076574401277381592_258324 bad datanode[0] 10.99.97.106:50010
> 2010-12-09 06:25:39,164 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional
data from server sessionid 0x12ca58c08ff004b, likely server has closed socket, closing socket
connection and attempting reconnect
> 2010-12-09 06:25:39,164 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block
blk_-7076574401277381592_258324 in pipeline 10.99.97.106:50010, 10.100.154.103:50010: bad
datanode 10.99.97.106:50010
> 2010-12-09 06:25:39,165 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block
blk_1141147913164105086_258331 bad datanode[0] 10.99.97.106:50010
> 2010-12-09 06:25:39,167 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional
data from server sessionid 0x12ca58c08ff004c, likely server has closed socket, closing socket
connection and attempting reconnect
> 2010-12-09 06:25:39,169 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
/10.100.154.103:50020. Already tried 0 time(s).
> 2010-12-09 06:25:39,169 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block
blk_1141147913164105086_258331 in pipeline 10.99.97.106:50010, 10.100.159.13:50010: bad datanode
10.99.97.106:50010
> 2010-12-09 06:25:39,209 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGIONSERVER_STOP
> 2010-12-09 06:25:39,210 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 60020
> 2010-12-09 06:25:39,210 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 3
on 60020: exiting
> 2010-12-09 06:25:39,210 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 9
on 60020: exiting
> 2010-12-09 06:25:39,210 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 24
on 60020: exiting
> 2010-12-09 06:25:39,210 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler
1 on 60020: exiting
> 2010-12-09 06:25:39,210 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 15
on 60020: exiting
> 2010-12-09 06:25:39,215 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler
0 on 60020: exiting
> 2010-12-09 06:25:39,214 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler
3 on 60020: exiting
> 2010-12-09 06:25:39,214 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC Server listener
on 60020
> 2010-12-09 06:25:39,215 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler
2 on 60020: exiting
> 2010-12-09 06:25:39,215 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler
7 on 60020: exiting
> 2010-12-09 06:25:39,213 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 21
on 60020: exiting
> 2010-12-09 06:25:39,215 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler
4 on 60020: exiting
> 2010-12-09 06:25:39,215 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC Server Responder
> 2010-12-09 06:25:39,215 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler
5 on 60020: exiting
> 2010-12-09 06:25:39,216 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler
8 on 60020: exiting
> 2010-12-09 06:25:39,216 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler
9 on 60020: exiting
> 2010-12-09 06:25:39,216 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler
6 on 60020: exiting
> 2010-12-09 06:25:39,216 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping
infoServer
> 2010-12-09 06:25:39,223 WARN org.apache.hadoop.hdfs.DFSClient: Failed recovery attempt
#0 from primary datanode 10.100.159.13:50010
>
>
>
>

Mime
View raw message