hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amandeep Khurana <ama...@gmail.com>
Subject Re: Region servers down when inserting with hbase0.20.0 rc
Date Thu, 06 Aug 2009 08:27:17 GMT
Since your servers might be getting starved for i/o, its a good idea
to check the throughput of the hard drives once.

As root, do:
hdparm -t /dev/sda1.
That'll check the throughput of sda1 drive. Take it from there.

On 8/5/09, Zheng Lv <lvzheng19800619@gmail.com> wrote:
> Hello,
>     I adjusted the option "zookeeper.session.timeout" to 120000, and then
> restarted the hbase cluster and the test program. After running normally for
> 14
>
> hours, one of datanodes shut down. When I restarted the hadoop and hbase,
> and checked the row count of table 'webpage', I got the result of 6625,
> while the
>
> test program log telling me there should be at least 885000. There are too
> many data lost. Following is the end part of the datanode log in that
> server.
>
> 2009-08-06 04:28:32,214 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 192.168.33.9:45465, dest: /192.168.33.6:50010, bytes: 1214,
>
> op: HDFS_WRITE, cliID: DFSClient_1777493426, srvID:
> DS-1028185837-192.168.33.6-50010-1249268609430, blockid:
> blk_-402434507207277902_27468
> 2009-08-06 04:28:32,214 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for block
> blk_-402434507207277902_27468 terminating
> 2009-08-06 04:28:32,606 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 192.168.33.6:50010, dest: /192.168.33.5:44924, bytes: 446,
>
> op: HDFS_READ, cliID: DFSClient_-255011821, srvID:
> DS-1028185837-192.168.33.6-50010-1249268609430, blockid:
> blk_-2647720945992878390_27447
> 2009-08-06 04:28:32,612 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 192.168.33.6:50010, dest: /192.168.33.5:44925, bytes: 277022,
>
> op: HDFS_READ, cliID: DFSClient_-255011821, srvID:
> DS-1028185837-192.168.33.6-50010-1249268609430, blockid:
> blk_-2647720945992878390_27447
> 2009-08-06 04:28:32,770 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> blk_-5186903983646527212_27469 src: /192.168.33.5:44941 dest:
>
> /192.168.33.6:50010
> 2009-08-06 04:29:35,672 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> blk_1888582734643135148_27447 1 Exception
>
> java.net.SocketTimeoutException: 60000 millis timeout while waiting for
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
>
>
> local=/192.168.33.6:35418 remote=/192.168.33.5:50010]
>         at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>         at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>         at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>         at java.io.DataInputStream.readFully(DataInputStream.java:178)
>         at java.io.DataInputStream.readLong(DataInputStream.java:399)
>         at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:853)
>         at java.lang.Thread.run(Thread.java:619)
>
> 2009-08-06 04:29:35,673 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for block
> blk_1888582734643135148_27447 terminating
> 2009-08-06 04:29:35,683 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock
> for block blk_1888582734643135148_27447
>
> java.io.EOFException: while trying to read 65557 bytes
> 2009-08-06 04:29:35,689 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> blk_1888582734643135148_27447 received exception
>
> java.io.EOFException: while trying to read 65557 bytes
> 2009-08-06 04:29:35,689 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> 192.168.33.6:50010, storageID=DS-1028185837-192.168.33.6
>
> -50010-1249268609430, infoPort=50075, ipcPort=50020):DataXceiver
> java.io.EOFException: while trying to read 65557 bytes
>         at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:264)
>         at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:308)
>         at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:372)
>         at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:524)
>         at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:357)
>         at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103)
>         at java.lang.Thread.run(Thread.java:619)
>
>
>
>
>     *************************************
>
>
>
>
>     And following is part of the content of test program log.
>
> insertting 880000 webpages need 51920792 ms.
> insertting 881000 webpages need 51972741 ms.
> insertting 882000 webpages need 52024775 ms.
> 09/08/06 04:32:20 WARN zookeeper.ClientCnxn: Exception closing session
> 0x222e91bb6b90002 to sun.nio.ch.SelectionKeyImpl@527809c6
> java.io.IOException: TIMED OUT
>         at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:858)
> 09/08/06 04:32:21 INFO zookeeper.ClientCnxn: Attempting connection to server
> ubuntu3/192.168.33.8:2222
> 09/08/06 04:32:21 INFO zookeeper.ClientCnxn: Priming connection to
> java.nio.channels.SocketChannel[connected local=/192.168.33.7:52496
>
> remote=ubuntu3/192.168.
> 33.8:2222]
> 09/08/06 04:32:21 INFO zookeeper.ClientCnxn: Server connection successful
> insertting 883000 webpages need 52246380 ms.
> insertting 884000 webpages need 52298370 ms.
> insertting 885000 webpages need 52380479 ms.
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact
> region server Some server, retryOnlyOne=true, index=0, islastrow=true,
> tries=9,
>
> nu
> mtries=10, i=0, listsize=1, location=address: 192.168.33.5:60020,
> regioninfo: REGION => {NAME => 'webpage,http:\x2F\x2Fnews.163.com
> \x2F09\x2F0803\x2F01
>
> \x2F5FO
> O155J0001124J.html1249504151762_879696,1249504267420', STARTKEY =>
> 'http:\x2F\x2Fnews.163.com\x2F09\x2F0803\x2F01
>
> \x2F5FOO155J0001124J.html1249504151762_879696
> ', ENDKEY => '', ENCODED => 1607113409, TABLE => {{NAME => 'webpage',
> FAMILIES => [{NAME => 'CF_CONTENT', COMPRESSION => 'NONE', VERSIONS => '2',
> TTL =>
>
> '2147
> 483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'},
> {NAME => 'CF_INFORMATION', COMPRESSION => 'NONE', VERSIONS => '1', TTL =>
>
> '2147483
> 647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}},
> region=webpage,http:\x2F\x2Fnews.163.com\x2F09\x2F0803\x2F01
>
> \x2F5FOO155J0001124J.h
> tml1249504151762_879696,1249504267420 for region webpage,http:\x2F\
> x2Fnews.163.com\x2F09\x2F0803\x2F01
>
> \x2F5FOO155J0001124J.html1249504151762_879696,1249504267
> 420, row
> 'http:\x2F\x2Fnews.163.com\x2F09\x2F0803\x2F01\x2F5FOO155J0001124J.html1249504668723_885781',
> but failed after 10 attempts.
> Exceptions:
>
>         at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:1041)
>         at
> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:584)
>         at org.apache.hadoop.hbase.client.HTable.put(HTable.java:450)
>         at hbasetest.HBaseWebpage.insert(HBaseWebpage.java:82)
>         at hbasetest.InsertThread.run(InsertThread.java:26)
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact
> region server Some server, retryOnlyOne=true, index=0, islastrow=true,
> tries=9,
>
> nu
> mtries=10, i=0, listsize=1, location=address: 192.168.33.5:60020,
> regioninfo: REGION => {NAME => 'webpage,http:\x2F\x2Fnews.163.com
> \x2F09\x2F0803\x2F01
>
> \x2F5FO
> O155J0001124J.html1249504151762_879696,1249504267420', STARTKEY =>
> 'http:\x2F\x2Fnews.163.com\x2F09\x2F0803\x2F01
>
> \x2F5FOO155J0001124J.html1249504151762_879696
> ', ENDKEY => '', ENCODED => 1607113409, TABLE => {{NAME => 'webpage',
> FAMILIES => [{NAME => 'CF_CONTENT', COMPRESSION => 'NONE', VERSIONS => '2',
> TTL =>
>
> '2147
> 483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'},
> {NAME => 'CF_INFORMATION', COMPRESSION => 'NONE', VERSIONS => '1', TTL =>
>
> '2147483
> 647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}},
> region=webpage,http:\x2F\x2Fnews.163.com\x2F09\x2F0803\x2F01
>
> \x2F5FOO155J0001124J.h
> tml1249504151762_879696,1249504267420 for region webpage,http:\x2F\
> x2Fnews.163.com\x2F09\x2F0803\x2F01
>
> \x2F5FOO155J0001124J.html1249504151762_879696,1249504267
> 420, row
> 'http:\x2F\x2Fnews.163.com\x2F09\x2F0803\x2F01\x2F5FOO155J0001124J.html1249504754735_885782',
> but failed after 10 attempts.
> Exceptions:
>
>         at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:1041)
>         at
> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:584)
>         at org.apache.hadoop.hbase.client.HTable.put(HTable.java:450)
>         at hbasetest.HBaseWebpage.insert(HBaseWebpage.java:82)
>         at hbasetest.InsertThread.run(InsertThread.java:26)
> .
> .
> .
> .
> .
> .
> .
>
>
>
>     Any suggestion?
>     Thanks a lot,
>     LvZheng
>
> 2009/8/5 Zheng Lv <lvzheng19800619@gmail.com>
>
>> Hi Stack,
>>     Thank you very much for your explaination.
>>     We just adjusted the value of the property "zookeeper.session.timeout"
>> to 120000, and we are observing the system now.
>>     "Are nodes running on same nodes as hbase? " --Do you mean we should
>> have several servers running exclusively for zk cluster? But I'm afraid
>> that
>> we can not have that many servers. Any suggestion?
>>     We don't config the zk in zoo.cfg, but in hbase-site.xml. Following is
>> the content in hbase-site.xml about zk.
>>     <property>
>>       <name>hbase.zookeeper.property.clientPort</name>
>>       <value>2222</value>
>>     </property>
>>
>>      <property>
>>       <name>hbase.zookeeper.quorum</name>
>>       <value>ubuntu2,ubuntu3,ubuntu7,ubuntu9,ubuntu6</value>
>>     </property>
>>
>>     <property>
>>       <name>zookeeper.session.timeout</name>
>>       <value>120000</value>
>>     </property>
>>
>>     Thanks a lot,
>>     LvZheng
>>
>>
>


-- 


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

Mime
View raw message