hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Hunt <ph...@apache.org>
Subject Re: Region server goes away
Date Mon, 19 Apr 2010 22:01:11 GMT
Btw, we (ZK) changed this message from a warning to an INFO in 3.3.0. 
It's basically saying that the client is trying to create a znode that 
already exists. Which is actually fine (an expected case from the client 
API side of things), just that some server side error logging code we 
had was catching this in the net along with the errors that we really 
wanted to log at WARN level. Unless HBase is not expecting this 
(creating a node that already exists) it's not really a problem.

Patrick

On 04/19/2010 02:11 PM, Geoff Hendrey wrote:
> As a follow-up to this saga: Hbase seems to be healthy at this time, modulo the WARN
below, which I have not figured out how to ameliorate. I believe that some of the issues with
HDFS corruption were caused by the large write buffers that I was using in the mapreduce job
(32,000 was the number of Put that would be buffered before a commit). I had tried many write
buffer values on smaller jobs, and had determined 32,000 to be optimal. However, it seems
when I scaled up the mapreduce job, the 32K write buffer was just way too high. I scaled it
way down to 100, and I don't get any errors or HDFS corruptions.
>
> Finally, sometimes one of my two region servers seems to disappear (running 'status'
in the Hbase shell shows only 1 region server). However, when I restart Hbase, the dead region
server comes back.
>
> Thanks for the advice and pointers.
>
> -geoff
>
>
> -----Original Message-----
> From: Geoff Hendrey
> Sent: Thursday, April 15, 2010 10:26 AM
> To: hbase-user@hadoop.apache.org
> Subject: RE: Region server goes away
>
> After making all the recommended config changes, the only issue I see it this, in the
zookeeper logs. It happens repeatedly. Hbase shell seems to work fine, running it on same
machine as the zookeeper. Any ideas? I reviewed a thread in the email list, on this topic,
but it seemed inconclusive.:
>
> 2010-04-15 04:14:36,048 WARN org.apache.zookeeper.server.PrepRequestProcessor:  ot exception
when processing sessionid:0x128012c809c0000 type:create cxid:0x4 z id:0xfffffffffffffffe txntype:unknown
n/a
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = Nod Existsof
0x128012c809c0002 valid:true
>          at org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepReques Processor.java:245)87c5a0000
>          at org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProc ssor.java:114)27fe787c5a3bba
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
> Sent: Wednesday, April 14, 2010 8:45 PM
> To: hbase-user@hadoop.apache.org
> Cc: Paul Mahon; Bill Brune; Shaheen Bahauddin; Rohit Nigam
> Subject: Re: Region server goes away
>
> On Wed, Apr 14, 2010 at 8:27 PM, Geoff Hendrey<ghendrey@decarta.com>  wrote:
>> Hi,
>>
>> I have posted previously about issues I was having with HDFS when I
>> was running HBase and HDFS on the same box both pseudoclustered. Now I
>> have two very capable servers. I've setup HDFS with a datanode on each box.
>> I've setup the namenode on one box, and the zookeeper and HDFS master
>> on the other box. Both boxes are region servers. I am using hadoop
>> 20.2 and hbase 20.3.
>
> What do you have for replication?  If two datanodes, you've set it to two rather than
default 3?
>
>
>>
>> I have set dfs.datanode.socket.write.timeout to 0 in hbase-site.xml.
>>
> This is probably not necessary.
>
>
>> I am running a mapreduce job with about 200 concurrent reducers, each
>> of which writes into HBase, with 32,000 row flush buffers.
>
>
> Why don't you try with just a few reducers first and then build it up?
>   See if that works?
>
>
> About 40%
>> through the completion of my job, HDFS started showing one of the
>> datanodes was dead (the one *not* on the same machine as the namenode).
>
>
> Do you think it dead -- what did a threaddump say? -- or was it just that you couldn't
get into it?  Any errors in the datanode logs complaining about xceiver count or perhaps you
need to up the number of handlers?
>
>
>
>> I stopped HBase, and magically the datanode came back to life.
>>
>> Any suggestions on how to increase the robustness?
>>
>>
>> I see errors like this in the datanode's log:
>>
>> 2010-04-14 12:54:58,692 ERROR
>> org.apache.hadoop.hdfs.server.datanode.DataNode: D
>> atanodeRegistration(10.241.6.80:50010,
>> storageID=DS-642079670-10.241.6.80-50010-
>> 1271178858027, infoPort=50075, ipcPort=50020):DataXceiver
>> java.net.SocketTimeoutException: 480000 millis timeout while waiting
>> for channel
>
>
> I believe this harmless.  Its just the DN timing out the socket -- you set the timeout
to 0 in the hbase-site.xml rather than in hdfs-site.xml where it would have an effect.  See
HADOOP-3831 for detail.
>
>
>>   to be ready for write. ch : java.nio.channels.SocketChannel[connected
>> local=/10
>> .241.6.80:50010 remote=/10.241.6.80:48320]
>>         at
>> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeo
>> ut.java:246)
>>         at
>> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutput
>> Stream.java:159)
>>         at
>> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutput
>> Stream.java:198)
>>         at
>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSe
>> nder.java:313)
>>         at
>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSen
>> der.java:400)
>>         at
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXcei
>> ver.java:180)
>>         at
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.ja
>> :
>>
>>
>> Here I show the output of 'hadoop dfsadmin -report'. First time it is
>> invoked, all is well. Second time, one datanode is dead. Third time,
>> the dead datanode has come back to life.:
>>
>> [hadoop@dt1 ~]$ hadoop dfsadmin -report Configured Capacity:
>> 1277248323584 (1.16 TB) Present Capacity: 1208326105528 (1.1 TB) DFS
>> Remaining: 1056438108160 (983.88 GB) DFS Used: 151887997368 (141.46
>> GB) DFS Used%: 12.57% Under replicated blocks: 3479 Blocks with
>> corrupt replicas: 0 Missing blocks: 0
>>
>> -------------------------------------------------
>> Datanodes available: 2 (2 total, 0 dead)
>>
>> Name: 10.241.6.79:50010
>> Decommission Status : Normal
>> Configured Capacity: 643733970944 (599.52 GB) DFS Used: 75694104268
>> (70.5 GB) Non DFS Used: 35150238004 (32.74 GB) DFS Remaining:
>> 532889628672(496.29 GB) DFS Used%: 11.76% DFS Remaining%: 82.78% Last
>> contact: Wed Apr 14 11:20:59 PDT 2010
>>
>>
>
> Yeah, my guess as per above is that the reporting client couldn't get on to the datanode
because handlers were full or xceivers exceeded.
>
> Let us know how it goes.
> St.Ack
>
>
>> Name: 10.241.6.80:50010
>> Decommission Status : Normal
>> Configured Capacity: 633514352640 (590.01 GB) DFS Used: 76193893100
>> (70.96 GB) Non DFS Used: 33771980052 (31.45 GB) DFS Remaining:
>> 523548479488(487.59 GB) DFS Used%: 12.03% DFS Remaining%: 82.64% Last
>> contact: Wed Apr 14 11:14:37 PDT 2010
>>
>>
>> [hadoop@dt1 ~]$ hadoop dfsadmin -report Configured Capacity:
>> 643733970944 (599.52 GB) Present Capacity: 609294929920 (567.45 GB)
>> DFS Remaining: 532876144640 (496.28 GB) DFS Used: 76418785280 (71.17
>> GB) DFS Used%: 12.54% Under replicated blocks: 3247 Blocks with
>> corrupt replicas: 0 Missing blocks: 0
>>
>> -------------------------------------------------
>> Datanodes available: 1 (2 total, 1 dead)
>>
>> Name: 10.241.6.79:50010
>> Decommission Status : Normal
>> Configured Capacity: 643733970944 (599.52 GB) DFS Used: 76418785280
>> (71.17 GB) Non DFS Used: 34439041024 (32.07 GB) DFS Remaining:
>> 532876144640(496.28 GB) DFS Used%: 11.87% DFS Remaining%: 82.78% Last
>> contact: Wed Apr 14 11:28:38 PDT 2010
>>
>>
>> Name: 10.241.6.80:50010
>> Decommission Status : Normal
>> Configured Capacity: 0 (0 KB)
>> DFS Used: 0 (0 KB)
>> Non DFS Used: 0 (0 KB)
>> DFS Remaining: 0(0 KB)
>> DFS Used%: 100%
>> DFS Remaining%: 0%
>> Last contact: Wed Apr 14 11:14:37 PDT 2010
>>
>>
>> [hadoop@dt1 ~]$ hadoop dfsadmin -report Configured Capacity:
>> 1277248323584 (1.16 TB) Present Capacity: 1210726427080 (1.1 TB) DFS
>> Remaining: 1055440003072 (982.96 GB) DFS Used: 155286424008 (144.62
>> GB) DFS Used%: 12.83% Under replicated blocks: 3338 Blocks with
>> corrupt replicas: 0 Missing blocks: 0
>>
>> -------------------------------------------------
>> Datanodes available: 2 (2 total, 0 dead)
>>
>> Name: 10.241.6.79:50010
>> Decommission Status : Normal
>> Configured Capacity: 643733970944 (599.52 GB) DFS Used: 77775145981
>> (72.43 GB) Non DFS Used: 33086850051 (30.81 GB) DFS Remaining:
>> 532871974912(496.28 GB) DFS Used%: 12.08% DFS Remaining%: 82.78% Last
>> contact: Wed Apr 14 11:29:44 PDT 2010
>>
>>
>> Name: 10.241.6.80:50010
>> Decommission Status : Normal
>> Configured Capacity: 633514352640 (590.01 GB) DFS Used: 77511278027
>> (72.19 GB) Non DFS Used: 33435046453 (31.14 GB) DFS Remaining:
>> 522568028160(486.68 GB) DFS Used%: 12.24% DFS Remaining%: 82.49% Last
>> contact: Wed Apr 14 11:29:44 PDT 2010
>>
>>
>>
>>

Mime
View raw message