hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Hunt <ph...@apache.org>
Subject Re: regionserver disconnection
Date Mon, 30 Nov 2009 23:17:40 GMT
FYI: https://issues.apache.org/jira/browse/ZOOKEEPER-601

Stack/JD/HBaseTeam - have you considered detailing/presenting Sun JVM GC 
limitations you've faced as part of HBase effort? I find this really 
fascinating (4min gc pauses even with "realtime" gc turned on), I doubt 
many people realize things can get this bad, and I think it would be a 
great service to the community if you could detail some of your findings 
(say at the next BAHUG?) You might even tag team it with someone from 
the Hadoop MR/HDFS team, I hear they face similar issues in the name node.

Regards,

Patrick

Zhenyu Zhong wrote:
> Tons of thanks to Stack and J-D as well as the Zookeeper fellows.
> 
> Previously we experienced RS session timeouts even we set the
> zookeeper.session.timeout to 10 minutes.
> It turned out that the zookeeper has both lower bound and upper bound for
> the session timeout.
> It actually relates to the tickerTime, so the session timeout should be no
> less than 2 times tickertime and no greater than 20 times tickerTime. By
> default the tickerTime is 2 sec, so the maximum timeout is 40 seconds by
> default no matter how large the value is set for zookeeper.session.timeout.
> 
> Due to some unknown reason, (probably system heavy load, GC pauses, etc),
> 40 seconds is not long enough for us, so we have to extend that value.
> 
> One way to do that is to up the hbase.zookeeper.property.tickTime. Now with
> 4000ms set on that value, we haven't seen the RS session TIMEOUT so far.
> 
> Many thanks to Stack, J-D and Zookeeper fellows.
> 
> I will keep you posted if the session timeout comes back again or there is
> anything else causes the RS disconnections.
> 
> Best,
> zhenyu
> 
> 
> 
> 
> On Thu, Nov 19, 2009 at 1:37 PM, stack <stack@duboce.net> wrote:
> 
>> From the zk fellas, try running your zk cluster at INFO level rather than
>> at
>> DEBUG.  One supposition is that the shear amount of logging is causing
>> distress.   They are still looking into your logs...
>>
>> Thanks for your patience Zhenyu.
>> St.Ack
>>
>>
>> On Thu, Nov 19, 2009 at 8:57 AM, Zhenyu Zhong <zhongresearch@gmail.com
>>> wrote:
>>> Yes, I believe so.  And I don't know why the leader gave up.
>>> For this case, the regionserver got TIMEOUT warning message at 2009-11-18
>>> 12:42:01,482
>>>
>>>
>>> Best,
>>> zhenyu
>>>
>>>
>>>
>>>
>>> On Thu, Nov 19, 2009 at 11:43 AM, stack <stack@duboce.net> wrote:
>>>
>>>> So, the time at which we see
>>>>
>>>> 2009-11-18 12:41:56,782 ERROR
>>>> org.apache.zookeeper.server.quorum.FollowerHandler: Unexpected
>> exception
>>>> causing shutdown while sock still open
>>>>
>>>> ... corresponds to the time at which the regionserver loses is session?
>>>>
>>>> St.Ack
>>>>
>>>> On Thu, Nov 19, 2009 at 8:38 AM, Zhenyu Zhong <zhongresearch@gmail.com
>>>>> wrote:
>>>>> Here is the zookeeper leader log. It had an unexpected exception.  I
>>>> don't
>>>>> know why.
>>>>>
>>>>> http://pastebin.com/m6594ccc
>>>>>
>>>>> During that time, I don't think the system load is high. Memory usage
>>>>> should
>>>>> be normal.
>>>>>
>>>>> Best,
>>>>> zhenyu
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Nov 19, 2009 at 11:09 AM, stack <stack@duboce.net> wrote:
>>>>>
>>>>>> What is in the other zk logs?  Why did the leader go away?  Memory
>>>> issue
>>>>> or
>>>>>> something?  Why'd it shutdown?  Thanks for pastebin'ing this stuff.
>>>>>> St.Ack
>>>>>>
>>>>>> On Thu, Nov 19, 2009 at 7:53 AM, Zhenyu Zhong <
>>> zhongresearch@gmail.com
>>>>>>> wrote:
>>>>>>> After digging more on the zookeeper log, I found that there
>> seemed
>>> to
>>>>> be
>>>>>> a
>>>>>>> socketconnection timeout. Also some warnings indicate that
>>> zookeeper
>>>>>> server
>>>>>>> is not running. Probably I need to understand the mechanism of
>>>>> zookeeper
>>>>>>> first.
>>>>>>>
>>>>>>> Please find the zookeeper log around that time.
>>>>>>>
>>>>>>> http://pastebin.com/m37bb4ad1
>>>>>>>
>>>>>>> Many thanks
>>>>>>> zhenyu
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Nov 19, 2009 at 10:31 AM, Zhenyu Zhong <
>>>>> zhongresearch@gmail.com
>>>>>>>> wrote:
>>>>>>>> Stack,
>>>>>>>>
>>>>>>>> I am very appreciated. Let me dig into the zookeeper log more.
>>>>>>>>
>>>>>>>> FYI, I saw the sleeper complained about the delay for the first
>>>> time.
>>>>>> It
>>>>>>>> looks like there are different reasons for the RS
>> disconnections.
>>>>>>>> 2009-11-19 08:08:32,708 DEBUG org.apache.zookeeper.ClientCnxn:
>>> Got
>>>>> ping
>>>>>>>> response for sessionid:0x42508716bfc0001 after 0ms
>>>>>>>> 2009-11-19 08:08:37,595 DEBUG org.apache.zookeeper.ClientCnxn:
>>> Got
>>>>> ping
>>>>>>>> response for sessionid:0x42508716bfc0082 after 0ms
>>>>>>>> 2009-11-19 08:08:46,073 DEBUG org.apache.zookeeper.ClientCnxn:
>>> Got
>>>>> ping
>>>>>>>> response for sessionid:0x42508716bfc0001 after 0ms
>>>>>>>> 2009-11-19 08:09:39,880 WARN org.apache.zookeeper.ClientCnxn:
>>>>> Exception
>>>>>>>> closing session 0x42508716bfc0001 to
>>>>>> sun.nio.ch.SelectionKeyImpl@4f3a608f
>>>>>>>> java.io.IOException: TIMED OUT
>>>>>>>>         at
>>>>>>>>
>>> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
>>>>>>>> 2009-11-19 08:09:39,880 WARN org.apache.zookeeper.ClientCnxn:
>>>>> Exception
>>>>>>>> closing session 0x42508716bfc0082 to
>>>>>> sun.nio.ch.SelectionKeyImpl@14c8875
>>>>>>>> java.io.IOException: TIMED OUT
>>>>>>>>         at
>>>>>>>>
>>> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
>>>>>>>> 2009-11-19 08:09:39,880 WARN
>>> org.apache.hadoop.hbase.util.Sleeper:
>>>> We
>>>>>>> slept
>>>>>>>> 53556ms, ten times longer than scheduled: 3000
>>>>>>>> 2009-11-19 08:09:39,980 WARN
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1
>>>>>>>> org.apache.hadoop.hbase.Leases$LeaseStillHeldException
>>>>>>>>         at
>>>>>> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>>>>>>> Method)
>>>>>>>>         at
>>>>>>>>
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>>>>>>>>         at
>>>>>>>>
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>>>>>>>>         at
>>>>>>> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>>>>>>>>         at
>>>>>>>>
>> org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
>>>>>>>>         at
>>>>>>>>
>> org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
>>>>>>>>         at
>>>>>>>>
>> org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
>>>>>>>>         at
>>>>>>>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:571)
>>>>>>>>         at java.lang.Thread.run(Thread.java:619)
>>>>>>>> 2009-11-19 08:09:39,980 INFO
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Got
>> ZooKeeper
>>>>>> event,
>>>>>>>> state: Disconnected, type: None, path: null
>>>>>>>> 2009-11-19 08:09:40,530 INFO org.apache.zookeeper.ClientCnxn:
>>>>>> Attempting
>>>>>>>> connection to server superpyxis0005/192.168.100.119:2181
>>>>>>>> 2009-11-19 08:09:40,530 INFO org.apache.zookeeper.ClientCnxn:
>>>> Priming
>>>>>>>> connection to java.nio.channels.SocketChannel[connected local=/
>>>>>>>> 192.168.100.122:40659 remote=superpyxis0005/
>> 192.168.100.119:2181
>>> ]
>>>>>>>>
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> zhenyu
>>>>>>>>
>>>>>>>> On Wed, Nov 18, 2009 at 6:52 PM, stack <stack@duboce.net>
>> wrote:
>>>>>>>>> Patrick Hunt, one of the zk lads, answered with the following:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> According to the following snippet (I cut to just session
>>>>>>>>> 0x425035c48210002)
>>>>>>>>> the client is getting pings for this session id regularly
>> every
>>> 6
>>>>>>> seconds
>>>>>>>>> or
>>>>>>>>> so, then after 41,016 there's a 20 second gap (corresponds to
>> a
>>> 30
>>>>>>> second
>>>>>>>>> timeout being set during session creation) at which point we
>>>>>> disconnect
>>>>>>>>> from
>>>>>>>>> the server.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> 2009-11-18 12:41:34,788 DEBUG
>> org.apache.zookeeper.ClientCnxn:
>>>> Got
>>>>>>> ping
>>>>>>>>>> response for sessionid:0x425035c48210002 after 0ms
>>>>>>>>>> 2009-11-18 12:42:01,482 WARN
>> org.apache.zookeeper.ClientCnxn:
>>>>>>> Exception
>>>>>>>>>> closing session 0x425035c48210002 to
>>>>>>>>> sun.nio.ch.SelectionKeyImpl@421690ab
>>>>>>>>>> java.io.IOException: TIMED OUT
>>>>>>>>> the client notifies hbase of the disco, and attempts to
>>> reconnect
>>>> to
>>>>> a
>>>>>>>>> server
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Got
>>>> ZooKeeper
>>>>>>> event,
>>>>>>>>>> state: Disconnected, type: None, path: null
>>>>>>>>>> 2009-11-18 12:42:01,782 INFO
>> org.apache.zookeeper.ClientCnxn:
>>>>>>> Attempting
>>>>>>>>>> connection to server superpyxis0003/192.168.100.117:2181
>>>>>>>>> but the server won't allow the session be be established for
>>> some
>>>>>> reason
>>>>>>>>> (is
>>>>>>>>> that the right server:port that's it's trying to connect to?)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> 2009-11-18 12:42:02,182 INFO
>> org.apache.zookeeper.ClientCnxn:
>>>>> Server
>>>>>>>>>> connection successful
>>>>>>>>>> 2009-11-18 12:42:02,182 WARN
>> org.apache.zookeeper.ClientCnxn:
>>>>>>> Exception
>>>>>>>>>> closing session 0x425035c48210002 to
>>>>>>>>> sun.nio.ch.SelectionKeyImpl@5c07076b
>>>>>>>>>> java.io.IOException: Read error rc = -1
>>>>>>> java.nio.DirectByteBuffer[pos=0
>>>>>>>>>> lim=4 cap=4]
>>>>>>>>> The "read error rc = -1" means that the client read from the
>>>> socket
>>>>>> and
>>>>>>>>> got
>>>>>>>>> EOS. But it's not a session expiration afaict (and should not
>> be
>>>> at
>>>>>> this
>>>>>>>>> point in time).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> You need to look at the ZK server logs for this period in time
>>>> (for
>>>>>> this
>>>>>>>>> server superpyxis0003:2181, and superpyxis0001:2181, the
>> client
>>>>> tries
>>>>>> to
>>>>>>>>> connect to both of these). There should be some indication
>> there
>>>> why
>>>>>> the
>>>>>>>>> server is closing down the connection and not allowing the
>>> session
>>>>> to
>>>>>> be
>>>>>>>>> re-established.
>>>>>>>>>
>>>>>>>>> One thought - by default we limit clients to a max of 10
>>>> connections
>>>>>>> from
>>>>>>>>> a
>>>>>>>>> particular host to the server. If you exceed this limit the zk
>>>>> server
>>>>>>>>> (it's
>>>>>>>>> per server though, not per cluster, so perhaps this is not it)
>>>> will
>>>>>> not
>>>>>>>>> allow any  additional connections from the client host (you
>>> can't
>>>>>> exceed
>>>>>>>>> 10
>>>>>>>>> on a server). Could this be it? i.e. the user is running a
>>> number
>>>> of
>>>>>> ZK
>>>>>>>>> clients (say hbase processes) from the same host? Regardless,
>>> get
>>>>> the
>>>>>>>>> server
>>>>>>>>> log for this time period and it should be more clear.
>>>>>>>>>
>>>>>>>>> Patrick
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Nov 18, 2009 at 11:18 AM, stack <stack@duboce.net>
>>> wrote:
>>>>>>>>>> And running stat on each of your zookeeper nodes tells you
>>> what
>>>>>> about
>>>>>>>>>> averages for connect-times?
>>>>>>>>>>
>>>>>>>>>> Its odd that there is a ping just 20 seconds beforehand and
>>> you
>>>>> have
>>>>>>> set
>>>>>>>>>> the zk session timeout at 60 seconds (or ten minutes?).  Let
>>> me
>>>>> ask
>>>>>>> the
>>>>>>>>> zk
>>>>>>>>>> lads about it.
>>>>>>>>>>
>>>>>>>>>> If you look in the GC log for around this time, do you see
>>>>> anything?
>>>>>>>>>> St.Ack
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Nov 18, 2009 at 7:32 AM, Zhenyu Zhong <
>>>>>>> zhongresearch@gmail.com
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> After running more experiment, it seems that the
>>> disconnection
>>>> is
>>>>>> not
>>>>>>>>>>> related to the heavy load job, because the load average is
>>> low,
>>>>> the
>>>>>>>>> diskio
>>>>>>>>>>> is normal, memory Heap was not reached, also the virtual
>>> memory
>>>>>> stats
>>>>>>>>>>> shows
>>>>>>>>>>> that no swappings.  However, the disconnection still
>> happens.
>>>>>>>>>>> It looks like this time it pauses for 20 seconds. No idea
>> why
>>>>>>>>> regionserver
>>>>>>>>>>> disconnected.
>>>>>>>>>>>
>>>>>>>>>>> Any other suggestions please?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2009-11-18 12:41:00,956 DEBUG
>>> org.apache.zookeeper.ClientCnxn:
>>>>> Got
>>>>>>> ping
>>>>>>>>>>> response for sessionid:0x125035c495d02bb after 0ms
>>>>>>>>>>> 2009-11-18 12:41:08,074 DEBUG
>>> org.apache.zookeeper.ClientCnxn:
>>>>> Got
>>>>>>> ping
>>>>>>>>>>> response for sessionid:0x425035c48210002 after 0ms
>>>>>>>>>>> 2009-11-18 12:41:14,302 DEBUG
>>> org.apache.zookeeper.ClientCnxn:
>>>>> Got
>>>>>>> ping
>>>>>>>>>>> response for sessionid:0x125035c495d02bb after 0ms
>>>>>>>>>>> 2009-11-18 12:41:21,433 DEBUG
>>> org.apache.zookeeper.ClientCnxn:
>>>>> Got
>>>>>>> ping
>>>>>>>>>>> response for sessionid:0x425035c48210002 after 0ms
>>>>>>>>>>> 2009-11-18 12:41:27,656 DEBUG
>>> org.apache.zookeeper.ClientCnxn:
>>>>> Got
>>>>>>> ping
>>>>>>>>>>> response for sessionid:0x125035c495d02bb after 0ms
>>>>>>>>>>> 2009-11-18 12:41:34,788 DEBUG
>>> org.apache.zookeeper.ClientCnxn:
>>>>> Got
>>>>>>> ping
>>>>>>>>>>> response for sessionid:0x425035c48210002 after 0ms
>>>>>>>>>>> 2009-11-18 12:41:41,016 DEBUG
>>> org.apache.zookeeper.ClientCnxn:
>>>>> Got
>>>>>>> ping
>>>>>>>>>>> response for sessionid:0x125035c495d02bb after 1ms
>>>>>>>>>>> 2009-11-18 12:42:01,482 WARN
>> org.apache.zookeeper.ClientCnxn:
>>>>>>> Exception
>>>>>>>>>>> closing session 0x425035c48210002 to
>>>>>>>>> sun.nio.ch.SelectionKeyImpl@421690ab
>>>>>>>>>>> java.io.IOException: TIMED OUT
>>>>>>>>>>>        at
>>>>>>>>>>>
>>>>> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
>>>>>>>>>>> 2009-11-18 12:42:01,582 INFO
>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Got
>>>> ZooKeeper
>>>>>>>>> event,
>>>>>>>>>>> state: Disconnected, type: None, path: null
>>>>>>>>>>> 2009-11-18 12:42:01,782 INFO
>> org.apache.zookeeper.ClientCnxn:
>>>>>>>>> Attempting
>>>>>>>>>>> connection to server superpyxis0003/192.168.100.117:2181
>>>>>>>>>>> 2009-11-18 12:42:01,782 INFO
>> org.apache.zookeeper.ClientCnxn:
>>>>>> Priming
>>>>>>>>>>> connection to java.nio.channels.SocketChannel[connected
>>> local=/
>>>>>>>>>>> 192.168.100.132:46610 remote=superpyxis0003/
>>>> 192.168.100.117:2181
>>>>> ]
>>>>>>>>>>> 2009-11-18 12:42:01,782 INFO
>> org.apache.zookeeper.ClientCnxn:
>>>>>> Server
>>>>>>>>>>> connection successful
>>>>>>>>>>> 2009-11-18 12:42:01,782 WARN
>> org.apache.zookeeper.ClientCnxn:
>>>>>>> Exception
>>>>>>>>>>> closing session 0x425035c48210002 to
>>>>>>>>> sun.nio.ch.SelectionKeyImpl@56dc6fac
>>>>>>>>>>> java.io.IOException: Read error rc = -1
>>>>>>> java.nio.DirectByteBuffer[pos=0
>>>>>>>>>>> lim=4 cap=4]
>>>>>>>>>>>        at
>>>>>>>>>>>
>> org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
>>>>>>>>>>>        at
>>>>>>>>>>>
>>>>> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
>>>>>>>>>>> 2009-11-18 12:42:01,782 WARN
>> org.apache.zookeeper.ClientCnxn:
>>>>>>> Ignoring
>>>>>>>>>>> exception during shutdown input
>>>>>>>>>>> java.net.SocketException: Transport endpoint is not
>> connected
>>>>>>>>>>>         at sun.nio.ch.SocketChannelImpl.shutdown(Native
>>> Method)
>>>>>>>>>>>        at
>>>>>>>>>>>
>>>> sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640)
>>>>>>>>>>>        at
>>>>>>>>> sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
>>>>>>>>>>>         at
>>>>>>>>>>>
>>>> org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:999)
>>>>>>>>>>>        at
>>>>>>>>>>>
>>>>> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)
>>>>>>>>>>> 2009-11-18 12:42:01,782 WARN
>> org.apache.zookeeper.ClientCnxn:
>>>>>>> Ignoring
>>>>>>>>>>> exception during shutdown output
>>>>>>>>>>> java.net.SocketException: Transport endpoint is not
>> connected
>>>>>>>>>>>         at sun.nio.ch.SocketChannelImpl.shutdown(Native
>>> Method)
>>>>>>>>>>>        at
>>>>>>>>>>>
>>>> sun.nio.ch.SocketChannelImpl.shutdownOutput(SocketChannelImpl.java:651)
>>>>>>>>>>>        at
>> sun.nio.ch.SocketAdaptor.shutdownOutput(SocketAdaptor.java:368)
>>>>>>>>>>>        at
>>>>>>>>>>>
>> org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:1004)
>>>>>>>>>>>        at
>>>>>>>>>>>
>>>>> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)
>>>>>>>>>>> 2009-11-18 12:42:02,182 INFO
>> org.apache.zookeeper.ClientCnxn:
>>>>>>>>> Attempting
>>>>>>>>>>> connection to server superpyxis0001/192.168.100.115:2181
>>>>>>>>>>> 2009-11-18 12:42:02,182 INFO
>> org.apache.zookeeper.ClientCnxn:
>>>>>> Priming
>>>>>>>>>>> connection to java.nio.channels.SocketChannel[connected
>>> local=/
>>>>>>>>>>> 192.168.100.132:36197 remote=superpyxis0001/
>>>> 192.168.100.115:2181
>>>>> ]
>>>>>>>>>>> 2009-11-18 12:42:02,182 INFO
>> org.apache.zookeeper.ClientCnxn:
>>>>>> Server
>>>>>>>>>>> connection successful
>>>>>>>>>>> 2009-11-18 12:42:02,182 WARN
>> org.apache.zookeeper.ClientCnxn:
>>>>>>> Exception
>>>>>>>>>>> closing session 0x425035c48210002 to
>>>>>>>>> sun.nio.ch.SelectionKeyImpl@5c07076b
>>>>>>>>>>> java.io.IOException: Read error rc = -1
>>>>>>> java.nio.DirectByteBuffer[pos=0
>>>>>>>>>>> lim=4 cap=4]
>>>>>>>>>>>        at
>>>>>>>>>>>
>> org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
>>>>>>>>>>>        at
>>>>>>>>>>>
>>>>> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
>>>>>>>>>>> 2009-11-18 12:42:02,182 WARN
>> org.apache.zookeeper.ClientCnxn:
>>>>>>> Ignoring
>>>>>>>>>>> exception during shutdown input
>>>>>>>>>>> java.net.SocketException: Transport endpoint is not
>> connected
>>>>>>>>>>>         at sun.nio.ch.SocketChannelImpl.shutdown(Native
>>> Method)
>>>>>>>>>>>        at
>>>>>>>>>>>
>>>> sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640)
>>>>>>>>>>>        at
>>>>>>>>> sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
>>>>>>>>>>>         at
>>>>>>>>>>>
>>>> org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:999)
>>>>>>>>>>>        at
>>>>>>>>>>>
>>>>> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)
>>>>>>>>>>> 2009-11-18 12:42:02,182 WARN
>> org.apache.zookeeper.ClientCnxn:
>>>>>>> Ignoring
>>>>>>>>>>> exception during shutdown output
>>>>>>>>>>> java.net.SocketException: Transport endpoint is not
>> connected
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Nov 16, 2009 at 4:16 PM, Zhenyu Zhong <
>>>>>>> zhongresearch@gmail.com
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> Here is the diskIO and CPU around the time we had RS
>>>>>> disconnection
>>>>>>> on
>>>>>>>>>>> one
>>>>>>>>>>>> machine that runs RegionServer. It doesn't seem to be
>> high.
>>>>>> Similar
>>>>>>>>> disk
>>>>>>>>>>> and
>>>>>>>>>>>> cpu usage have been seen before and the HBase was running
>>>> fine.
>>>>>>>>>>>>
>>>>>>>>>>>> So far I haven't found why my 10 minutes session timeout
>>>>> doesn't
>>>>>>>>> apply.
>>>>>>>>>>>> Still digging.
>>>>>>>>>>>>
>>>>>>>>>>>> Many thanks!
>>>>>>>>>>>>
>>>>>>>>>>>> Time: 03:01:08 PM
>>>>>>>>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>>>>>>>>            3.82    0.00    0.89    3.97    0.00   91.32
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Device:            tps   Blk_read/s   Blk_wrtn/s
>> Blk_read
>>>>>>>>> Blk_wrtn
>>>>>>>>>>>> sda              24.30      1605.39      6679.35
>> 6238623388
>>>>>>>>> 25956251720
>>>>>>>>>>>> sda1              0.51         1.94        15.71
>>  7547265
>>>>>>>>> 61067904
>>>>>>>>>>>> sda2              0.13         2.11         5.20
>>  8202079
>>>>>>>>> 20189232
>>>>>>>>>>>> sda3              0.13         0.25         1.63
>> 987846
>>>>>>>>>  6323336
>>>>>>>>>>>> sda4              0.00         0.00         0.00
>>  4
>>>>>>>>>  0
>>>>>>>>>>>> sda5              1.78         0.06        24.14
>> 220010
>>>>>>>>> 93817208
>>>>>>>>>>>> sda6              0.38         0.42         8.41
>>  1630584
>>>>>>>>> 32688152
>>>>>>>>>>>> sda7             21.38      1600.61      6624.26
>> 6220035272
>>>>>>>>> 25742165888
>>>>>>>>>>>> sdb               4.52       767.33       565.91
>> 2981868690
>>>>>>>>> 2199132380
>>>>>>>>>>>> sdb1              4.52       767.33       565.91
>> 2981866755
>>>>>>>>> 2199132372
>>>>>>>>>>>> sdc               4.42       742.95       563.00
>> 2887151482
>>>>>>>>> 2187823092
>>>>>>>>>>>> sdc1              4.42       742.95       563.00
>> 2887149547
>>>>>>>>> 2187823084
>>>>>>>>>>>> sdd               4.49       750.78       557.25
>> 2917554074
>>>>>>>>> 2165513500
>>>>>>>>>>>> sdd1              4.49       750.78       557.25
>> 2917552139
>>>>>>>>> 2165513492
>>>>>>>>>>>> sde               4.52       758.51       569.46
>> 2947593394
>>>>>>>>> 2212964236
>>>>>>>>>>>> sde1              4.52       758.51       569.46
>> 2947591459
>>>>>>>>> 2212964228
>>>>>>>>>>>> sdf               4.51       747.22       571.78
>> 2903740266
>>>>>>>>> 2221972652
>>>>>>>>>>>> sdf1              4.51       747.22       571.78
>> 2903738331
>>>>>>>>> 2221972644
>>>>>>>>>>>> Best,
>>>>>>>>>>>> zhenyu
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Nov 16, 2009 at 4:07 PM, stack <stack@duboce.net
>>>>> wrote:
>>>>>>>>>>>>> On Mon, Nov 16, 2009 at 12:05 PM, Zhenyu Zhong <
>>>>>>>>>>> zhongresearch@gmail.com
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> I just realized that there was a MapReduce job running
>>>>> during
>>>>>>> the
>>>>>>>>>>> time
>>>>>>>>>>>>> the
>>>>>>>>>>>>>> regionserver disconnected from the zookeeper.
>>>>>>>>>>>>>> That MapReduce Job was processing 500GB data and took
>>>> about
>>>>> 8
>>>>>>>>> minutes
>>>>>>>>>>> to
>>>>>>>>>>>>>> finish. It launched over 2000 map tasks.
>>>>>>>>>>>>>
>>>>>>>>>>>>> There was a tasktracker running on the RegionServer that
>>>>>>>>> disconnected?
>>>>>>>>>>>  Is
>>>>>>>>>>>>> the map i/o or cpu heavy?  Do you think it could have
>>> stole
>>>>> life
>>>>>>>>> from
>>>>>>>>>>> the
>>>>>>>>>>>>> datanode/regionserver?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I doubt that this introduced
>>>>>>>>>>>>>> resource contention between DataNode and RegionServer
>>>> Node.
>>>>>>>>>>>>>> Also during the time that MapReduce job ran, I saw a
>> few
>>>>>> errors
>>>>>>>>>>>>> indicating
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> java.io.IOException: Could not obtain block:
>>>>>>>>>>>>>> blk_7661247556283580226_4450462
>>>>>> file=/data/xxx.txt.200910141159
>>>>>>>>>>>>>>        at
>>>>>>>>>>>>>>
>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1787)
>>>>>>>>>>>>>>        at
>>>>>>>>>>>>>>
>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1615)
>>>>>>>>>>>>>>        at
>>>>>>>>>>>>>>
>>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1742)
>>>>>>>>>>>>>>        at
>>>>>> java.io.DataInputStream.read(DataInputStream.java:83)
>>>>>>>>>>>>>>        at
>>>>> org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
>>>>>>>>>>>>>>        at
>>>>>>>>>>>>>>
>>> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
>>>>>>>>>>>>>>        at
>>>>>>>>>>>>>>
>> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
>>>>>>>>>>>>>>        at
>>>>>>>>>>>>>>
>> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
>>>>>>>>>>>>>>        at
>>>>>>>>>>>>>>
>> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
>>>>>>>>>>>>>>        at
>>>>>>>>> org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>>>>>>>>>>>>>>        at
>>>>> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>>>>>>>>>>>>>>        at
>>>>>> org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>>>>>>>>>>>>>>        at
>>>>> org.apache.hadoop.mapred.Child.main(Child.java:170)
>>>>>>>>>>>>> This is interesting.  Your map task is running into
>>>> hdfs-127.
>>>>>>>  Yeah,
>>>>>>>>>>> you
>>>>>>>>>>>>> should patch your cluster if you want to get rid of
>> these
>>>>>>> (hdfs-127
>>>>>>>>> has
>>>>>>>>>>>>> been
>>>>>>>>>>>>> applied to 0.21 hadoop and will be in the next hadoop
>>>> release
>>>>> on
>>>>>>>>> 0.20.x
>>>>>>>>>>>>> branch, hadoop 0.20.2).
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Possibly it is related to HDFS-127, but I don't see
>> any
>>>>>>> datanodes
>>>>>>>>>>> went
>>>>>>>>>>>>>> down. Does that affect Regionserver? Shall we apply
>> the
>>>>> patch?
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Read up on hdfs-127 for explaination of whats going on.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Now I start to keep tracking of the virtual memory
>> stats
>>>> to
>>>>>> see
>>>>>>> if
>>>>>>>>>>> the
>>>>>>>>>>>>> same
>>>>>>>>>>>>>> issue happens tomorrow around the same time period.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Excellent.  Thanks for doing this.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> And I highly suspect that this particular MapReduce
>> job
>>>>> hurts
>>>>>>>>> HBase
>>>>>>>>>>>>>> Regionserver.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> PS. I also use gcviewer to parse the GC-log, I only
>> see
>>>>> around
>>>>>>> 30
>>>>>>>>>>>>> seconds
>>>>>>>>>>>>>> pauses maximum.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Did you figure why your ten minute session is not
>>> applying?
>>>>>>>>>>>>> St.Ack
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks a lot.
>>>>>>>>>>>>>> zhenyu
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sun, Nov 15, 2009 at 12:52 AM, Zhenyu Zhong <
>>>>>>>>>>> zhongresearch@gmail.com
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> J-D,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thank you very much for your comments.
>>>>>>>>>>>>>>> My company block the IRC port, so I have trouble to
>>>>> connect
>>>>>> to
>>>>>>>>> IRC
>>>>>>>>>>>>>> channel.
>>>>>>>>>>>>>>> I have been trying to ask the IT to open the IRC
>> port
>>>> for
>>>>>> me,
>>>>>>> it
>>>>>>>>>>> might
>>>>>>>>>>>>>> take
>>>>>>>>>>>>>>> a while.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> zhenyu
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sat, Nov 14, 2009 at 2:21 PM, Jean-Daniel Cryans
>> <
>>>>>>>>>>>>> jdcryans@apache.org
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The error you are getting is a disconnection from a
>>>>>> zookeeper
>>>>>>>>>>> server
>>>>>>>>>>>>>>>> and is very generic.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ZK-86 is still opened and the last comment refers
>> to
>>>>> ZK-111
>>>>>>>>> saying
>>>>>>>>>>>>>>>> that the bug (in unit tests) was probably fixed in
>>>>> release
>>>>>>>>> 3.0.0
>>>>>>>>>>> last
>>>>>>>>>>>>>>>> year.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> To figure the hang you have, you can try to jstack
>>> the
>>>>>>> process
>>>>>>>>> pid
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> see exactly what's holding the RS from shutting
>> down.
>>>>>>>>>>>>>>>> Would it be possible for you to drop by the IRC
>>>> channel?
>>>>>> This
>>>>>>>>> way
>>>>>>>>>>> we
>>>>>>>>>>>>>>>> can debug this a at much faster pace.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thx!
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, Nov 14, 2009 at 9:57 AM, Zhenyu Zhong <
>>>>>>>>>>>>> zhongresearch@gmail.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> I found this.
>>>>>>>>>>>>>>>>>
>> http://issues.apache.org/jira/browse/ZOOKEEPER-86
>>>>>>>>>>>>>>>>> It looks like the same error I had. Is it a
>>> zookeeper
>>>>>> bug?
>>>>>>>>> When
>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>> HBase
>>>>>>>>>>>>>>>>> take the zookeeper version 3.3.0?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> thanks
>>>>>>>>>>>>>>>>> zhenyu
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sat, Nov 14, 2009 at 11:31 AM, Zhenyu Zhong <
>>>>>>>>>>>>>> zhongresearch@gmail.com
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Now I really doubt about the zookeeper, from the
>>> log
>>>> I
>>>>>> saw
>>>>>>>>>>> errors
>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>> IOException Read Error, while the zookeeper
>> client
>>>>>>>>>>> (regionserver )
>>>>>>>>>>>>>>>> tried to
>>>>>>>>>>>>>>>>>> read. But it got disconneted status from
>>> zookeeper.
>>>>>>>>>>>>>>>>>> I don't see any load on any zookeeper quorum
>>>> servers.
>>>>>>> DiskIO
>>>>>>>>> is
>>>>>>>>>>>>> fine.
>>>>>>>>>>>>>>>>>> Also when the regionserver decides to exit due
>> to
>>>> the
>>>>>>>>>>> disconnect
>>>>>>>>>>>>>> status
>>>>>>>>>>>>>>>>>> from zookeeper, sometimes the regionserver hangs
>>>>> during
>>>>>>> the
>>>>>>>>>>>>> exiting.
>>>>>>>>>>>>>> We
>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>> still see the HRegionServer process even we
>> don't
>>>> see
>>>>> it
>>>>>>> in
>>>>>>>>> the
>>>>>>>>>>>>>> Master
>>>>>>>>>>>>>>>> web
>>>>>>>>>>>>>>>>>> interface.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I also notice that there is an zookeeper.retries
>>> and
>>>>>>>>>>>>> zookeeper.pauses
>>>>>>>>>>>>>>>>>> settings in hbase-default.xml, would that matter
>> ?
>>>>>>>>>>>>>>>>>> thanks
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> You may see the IOException error in the
>> following
>>>> log
>>>>>>> when
>>>>>>>>> the
>>>>>>>>>>>>>>>>>> Regionserver lost connection to the zookeeper.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 2009-11-14 15:58:31,993 INFO
>>>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HLog:
>>>>>>>>>>>>>>>>>> Roll
>>>>>>>>>>>>>>>>>>
>> /hbase/.logs/superpyxis0008.scur.colo,60021,1258179101617/hlog.dat.1258179106921,
>>>>>>>>>>>>>>>>>> entries=1062964, calcsize=255014936,
>>>>> filesize=139566093.
>>>>>>> New
>>>>>>>>>>> hlog
>> /hbase/.logs/superpyxis0008.scur.colo,60021,1258179101617/hlog.dat.1258214311993
>>>>>>>>>>>>>>>>>> 2009-11-14 15:58:39,295 DEBUG
>>>>>>>>> org.apache.zookeeper.ClientCnxn:
>>>>>>>>>>> Got
>>>>>>>>>>>>>> ping
>>>>>>>>>>>>>>>>>> response for sessionid:0x424f0ed01ea00a6 after
>> 0ms
>>>>>>>>>>>>>>>>>> 2009-11-14 15:58:52,648 DEBUG
>>>>>>>>> org.apache.zookeeper.ClientCnxn:
>>>>>>>>>>> Got
>>>>>>>>>>>>>> ping
>>>>>>>>>>>>>>>>>> response for sessionid:0x424f0ed01ea00a6 after
>> 0ms
>>>>>>>>>>>>>>>>>> 2009-11-14 15:59:06,007 DEBUG
>>>>>>>>> org.apache.zookeeper.ClientCnxn:
>>>>>>>>>>> Got
>>>>>>>>>>>>>> ping
>>>>>>>>>>>>>>>>>> response for sessionid:0x424f0ed01ea00a6 after
>> 0ms
>>>>>>>>>>>>>>>>>> 2009-11-14 15:59:19,365 DEBUG
>>>>>>>>> org.apache.zookeeper.ClientCnxn:
>>>>>>>>>>> Got
>>>>>>>>>>>>>> ping
>>>>>>>>>>>>>>>>>> response for sessionid:0x424f0ed01ea00a6 after
>> 0ms
>>>>>>>>>>>>>>>>>> 2009-11-14 15:59:32,867 DEBUG
>>>>>>>>> org.apache.zookeeper.ClientCnxn:
>>>>>>>>>>> Got
>>>>>>>>>>>>>> ping
>>>>>>>>>>>>>>>>>> response for sessionid:0x424f0ed01ea00a6 after
>>> 186ms
>>>>>>>>>>>>>>>>>> 2009-11-14 15:59:46,070 DEBUG
>>>>>>>>> org.apache.zookeeper.ClientCnxn:
>>>>>>>>>>> Got
>>>>>>>>>>>>>> ping
>>>>>>>>>>>>>>>>>> response for sessionid:0x424f0ed01ea00a6 after
>> 0ms
>>>>>>>>>>>>>>>>>> 2009-11-14 15:59:59,423 DEBUG
>>>>>>>>> org.apache.zookeeper.ClientCnxn:
>>>>>>>>>>> Got
>>>>>>>>>>>>>> ping
>>>>>>>>>>>>>>>>>> response for sessionid:0x424f0ed01ea00a6 after
>> 0ms
>>>>>>>>>>>>>>>>>> 2009-11-14 16:00:12,775 DEBUG
>>>>>>>>> org.apache.zookeeper.ClientCnxn:
>>>>>>>>>>> Got
>>>>>>>>>>>>>> ping
>>>>>>>>>>>>>>>>>> response for sessionid:0x424f0ed01ea00a6 after
>> 0ms
>>>>>>>>>>>>>>>>>> 2009-11-14 16:00:28,141 DEBUG
>>>>>>>>> org.apache.zookeeper.ClientCnxn:
>>>>>>>>>>> Got
>>>>>>>>>>>>>> ping
>>>>>>>>>>>>>>>>>> response for sessionid:0x424f0ed01ea00a6 after
>>>> 2010ms
>>>>>>>>>>>>>>>>>> 2009-11-14 16:01:10,378 WARN
>>>>>>>>> org.apache.zookeeper.ClientCnxn:
>>>>>>>>>>>>>> Exception
>>>>>>>>>>>>>>>>>> closing session 0x424f0ed01ea00a6 to
>>>>>>>>>>>>>>>> sun.nio.ch.SelectionKeyImpl@39fbb2d6
>>>>>>>>>>>>>>>>>> java.io.IOException: Read error rc = -1
>>>>>>>>>>>>>> java.nio.DirectByteBuffer[pos=0
>>>>>>>>>>>>>>>>>> lim=4 cap=4]
>>>>>>>>>>>>>>>>>>         at
>>>>>>>>>>>>>>>>>>
>>> org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
>>>>>>>>>>>>>>>>>>         at
>>>>>>>>>>>>>>>>>>
>> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
>>>>>>>>>>>>>>>>>> 2009-11-14 16:01:10,478 INFO
>>>>>>>>>>>>>>>>>>
>>> org.apache.hadoop.hbase.regionserver.HRegionServer:
>>>>> Got
>>>>>>>>>>> ZooKeeper
>>>>>>>>>>>>>>>> event,
>>>>>>>>>>>>>>>>>> state: Disconnected, type: None, path: null
>>>>>>>>>>>>>>>>>> 2009-11-14 16:01:11,333 WARN
>>>>>>>>>>>>>>>>>>
>>> org.apache.hadoop.hbase.regionserver.HRegionServer:
>>>>>>>>> Attempt=1
>>>>>>>>>>>>>>>>>>
>>>> org.apache.hadoop.hbase.Leases$LeaseStillHeldException
>>>>>>>>>>>>>>>>>>         at
>>>>>> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>>>>>>>>>>>>>>>>> Method)
>>>>>>>>>>>>>>>>>>         at
>>>>>>>>>>>>>>>>>>
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>>>>>>>>>>>>>>>>>>         at
>>>>>>>>>>>>>>>>>>
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>>>>>>>>>>>>>>>>>>         at
>>>>>>> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>>>>>>>>>>>>>>>>>>         at
>>>>>>>>>>>>>>>>>>
>> org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
>>>>>>>>>>>>>>>>>>         at
>>>>>>>>>>>>>>>>>>
>> org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
>>>>>>>>>>>>>>>>>>         at
>>>>>>>>>>>>>>>>>>
>> org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
>>>>>>>>>>>>>>>>>>         at
>>>>>>>>>>>>>>>>>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:571)
>>>>>>>>>>>>>>>>>>         at java.lang.Thread.run(Thread.java:619)
>>>>>>>>>>>>>>>>>> 2009-11-14 16:01:11,433 INFO
>>>>>>>>> org.apache.zookeeper.ClientCnxn:
>>>>>>>>>>>>>>>> Attempting
>>>>>>>>>>>>>>>>>> connection to server superpyxis0001/
>>>>>> 192.168.100.115:2181
>>>>>>>>>>>>>>>>>> 2009-11-14 16:01:11,433 INFO
>>>>>>>>> org.apache.zookeeper.ClientCnxn:
>>>>>>>>>>>>> Priming
>>>>>>>>>>>>>>>>>> connection to
>>>>> java.nio.channels.SocketChannel[connected
>>>>>>>>> local=/
>>>>>>>>>>>>>>>>>> 192.168.100.122:59575 remote=superpyxis0001/
>>>>>>>>>>> 192.168.100.115:2181]
>>>>>>>>>>>>>>>>>> 2009-11-14 16:01:11,433 INFO
>>>>>>>>> org.apache.zookeeper.ClientCnxn:
>>>>>>>>>>>>> Server
>>>>>>>>>>>>>>>>>> connection successful
>>>>>>>>>>>>>>>>>> 2009-11-14 16:01:11,433 INFO
>>>>>>>>>>>>>>>>>>
>>> org.apache.hadoop.hbase.regionserver.HRegionServer:
>>>>> Got
>>>>>>>>>>> ZooKeeper
>>>>>>>>>>>>>>>> event,
>>>>>>>>>>>>>>>>>> state: Expired, type: None, path: null
>>>>>>>>>>>>>>>>>> 2009-11-14 16:01:11,433 WARN
>>>>>>>>> org.apache.zookeeper.ClientCnxn:
>>>>>>>>>>>>>> Exception
>>>>>>>>>>>>>>>>>> closing session 0x424f0ed01ea00a6 to
>>>>>>>>>>>>>>>> sun.nio.ch.SelectionKeyImpl@6b93d343
>>>>>>>>>>>>>>>>>> java.io.IOException: Session Expired
>>>>>>>>>>>>>>>>>>         at
>>>>>>>>>>>>>>>>>>
>> org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
>>>>>>>>>>>>>>>>>>         at
>>>>>>>>>>>>>>>>>>
>>> org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
>>>>>>>>>>>>>>>>>>         at
>>>>>>>>>>>>>>>>>>
>> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
>>>>>>>>>>>>>>>>>> 2009-11-14 16:01:11,433 ERROR
>>>>>>>>>>>>>>>>>>
>>> org.apache.hadoop.hbase.regionserver.HRegionServer:
>>>>>>>>> ZooKeeper
>>>>>>>>>>>>> session
>>>>>>>>>>>>>>>>>> expired
>>>>>>>>>>>>>>>>>> 2009-11-14 16:01:13,513 INFO
>>>>>>>>> org.apache.hadoop.ipc.HBaseServer:
>>>>>>>>>>>>> IPC
>>>>>>>>>>>>>>>> Server
>>>>>>>>>>>>>>>>>> handler 3 on 60021, call put([B@73deb204,
>>>>>>>>>>>>>>>>>> [Lorg.apache.hadoop.hbase.client.Put;@2179600a)
>>>> from
>>>>>>>>>>>>>>>> 192.168.100.132:40728:
>>>>>>>>>>>>>>>>>> error: java.io.IOException: Server not running,
>>>>> aborting
>>>>>>>>>>>>>>>>>> java.io.IOException: Server not running,
>> aborting
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sat, Nov 14, 2009 at 1:01 AM, Zhenyu Zhong <
>>>>>>>>>>>>>> zhongresearch@gmail.com
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> Stack,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks a lot!
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I found out that the reason why HBase doesn't
>>> take
>>>>> the
>>>>>>>>> system
>>>>>>>>>>>>> file
>>>>>>>>>>>>>>>>>>> descriptor value I set before. I started the
>>> HBase
>>>>>> using
>>>>>>>>> root
>>>>>>>>>>>>>> instead
>>>>>>>>>>>>>>>> of the
>>>>>>>>>>>>>>>>>>> normal hadoop user, while my system configures
>>>> hadoop
>>>>>>> with
>>>>>>>>>>> higher
>>>>>>>>>>>>>> file
>>>>>>>>>>>>>>>>>>> descriptor value but configures root with 1024
>>>>> default
>>>>>>>>> value.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Now my system has a clean start. However, it
>>> seems
>>>>> that
>>>>>>> the
>>>>>>>>>>>>>>>>>>> zookeeper.session.timeout value doesn't take
>> into
>>>>>> effect.
>>>>>>> I
>>>>>>>>>>> still
>>>>>>>>>>>>>>>> found
>>>>>>>>>>>>>>>>>>> around 60 seconds pauses from the disconnected
>>>>>>>>> regionserver. I
>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>>>>> know why regionserver only times out after 60
>>>> seconds
>>>>>>>>> instead
>>>>>>>>>>> of
>>>>>>>>>>>>> 10
>>>>>>>>>>>>>>>> minutes
>>>>>>>>>>>>>>>>>>> which I set for zookeeper.session.timeout.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Is there any other timeout value coming into
>> play
>>>>>> before
>>>>>>>>> the
>>>>>>>>>>>>> actual
>>>>>>>>>>>>>>>>>>> session times out?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> zhenyu
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Fri, Nov 13, 2009 at 7:08 PM, stack <
>>>>>> stack@duboce.net
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> Ok.  Lack of file descriptors manifests in all
>>>> kinds
>>>>>> of
>>>>>>>>> weird
>>>>>>>>>>>>> ways.
>>>>>>>>>>>>>>>>>>>> Hopefully thats it.  If not, lets keep
>> digging.
>>>>>>>>>>>>>>>>>>>> St.Ack
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Fri, Nov 13, 2009 at 3:44 PM, Zhenyu Zhong
>> <
>>>>>>>>>>>>>>>> zhongresearch@gmail.com
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> Stack,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> You are right, the master started with
>> ulimit
>>> -n
>>>>>> 1024.
>>>>>>>>> It
>>>>>>>>>>>>> doesn't
>>>>>>>>>>>>>>>> take
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> system value.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Regarding to the too many open files, it
>> looks
>>>>> like
>>>>>>> the
>>>>>>>>>>> same
>>>>>>>>>>>>> as
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>>>>>>>>> put up. But I will get the Master start with
>>>>> higher
>>>>>>>>> value
>>>>>>>>>>>>> first
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>>>> this kind of error goes away.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> thanks a lot!
>>>>>>>>>>>>>>>>>>>>> zhenyu
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 13, 2009 at 6:02 PM, stack <
>>>>>>>>> stack@duboce.net>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> Does it say
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> ulimit -n 32768
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> ...as the second line in your log file on
>>>> start
>>>>> of
>>>>>>> the
>>>>>>>>>>>>> master?
>>>>>>>>>>>>>>>>>>>>>> Is the stack trace that complains about
>> too
>>>> many
>>>>>>> open
>>>>>>>>>>> files
>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>> in the blog post J-D put up?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> St.Ack
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 13, 2009 at 1:37 PM, Zhenyu
>>> Zhong
>>>> <
>>>>>>>>>>>>>>>>>>>> zhongresearch@gmail.com
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>> The ulimit file descriptors was set to
>>>>>> fs.file-max
>>>>>>>>>>>>> =1578334,
>>>>>>>>>>>>>>>> also
>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> limits.conf the value was set to 32768.
>>>>>>>>>>>>>>>>>>>>>>> So these are way higher than the open
>>>>>> descriptors
>>>>>>>>> for
>>>>>>>>>>> the
>>>>>>>>>>>>>>>> running
>>>>>>>>>>>>>>>>>>>>>>> processes.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> thanks
>>>>>>>>>>>>>>>>>>>>>>> zhenyu
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 13, 2009 at 4:33 PM, Stack <
>>>>>>>>>>>>> saint.ack@gmail.com>
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>> You upped the ulimit file descriptors
>> as
>>>> per
>>>>>> the
>>>>>>>>>>> getting
>>>>>>>>>>>>>>>> started
>>>>>>>>>>>>>>>>>>>> doc?
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Nov 13, 2009, at 1:26 PM, Zhenyu
>>> Zhong
>>>> <
>>>>>>>>>>>>>>>>>>>> zhongresearch@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>  Thanks a lot.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Bad news is my kernel is still
>> 2.6.26.
>>>>>>>>>>>>>>>>>>>>>>>>> But it was not a problem before.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Very strange.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> zhenyu
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 13, 2009 at 4:16 PM,
>>>>> Jean-Daniel
>>>>>>>>> Cryans
>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>> jdcryans@apache.org
>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>  Looks like
>>>>>>>>>>>>>>>>>>>>>>>>>>
>> http://pero.blogs.aprilmayjune.org/2009/01/22/hadoop-and-linux-kernel-2627-epoll-limits/
>>>>>>>>>>>>>>>>>>>>>>>>>> J-D
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Nov 13, 2009 at 1:12 PM,
>>> Zhenyu
>>>>>> Zhong
>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>> zhongresearch@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> After I re-organize the cluster,
>> the
>>>>>>> experiment
>>>>>>>>>>> ran
>>>>>>>>>>>>> into
>>>>>>>>>>>>>>>>>>>> problem
>>>>>>>>>>>>>>>>>>>>>>> faster
>>>>>>>>>>>>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> before.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Basically the changes are to use
>>>> machines
>>>>>>> with
>>>>>>>>>>> less
>>>>>>>>>>>>>>>> resources
>>>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>>>>>>>>> zookeeper
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> quorums and machines with more
>>>> resources
>>>>> as
>>>>>>>>>>>>>> regionserver.
>>>>>>>>>>>>>>>>>>>>>>>>>>> From the log, I still see the pause
>>>>> around
>>>>>> 1
>>>>>>>>>>> minute.
>>>>>>>>>>>>>>>>>>>>>>>>>>> I enabled the GC logging,  please
>> see
>>>>>>>>>>>>>>>>>>>>>>>>>>> http://pastebin.com/m1d4ce0f1
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> for details.
>>>>>>>>>>>>>>>>>>>>>>>>>>> In general I don't see many pauses
>> in
>>>> the
>>>>>> GC.
>>>>>>>>>>>>>>>>>>>>>>>>>>> What is more interesting, 1 hour
>>> after
>>>>> the
>>>>>>> 1st
>>>>>>>>>>>>>>>> regionserver
>>>>>>>>>>>>>>>>>>>>>>>>>>> disconnected,
>>>>>>>>>>>>>>>>>>>>>>>>>>> the master log complained about too
>>>> many
>>>>>> open
>>>>>>>>>>> files.
>>>>>>>>>>>>>> This
>>>>>>>>>>>>>>>>>>>> didn't
>>>>>>>>>>>>>>>>>>>>>>> happen
>>>>>>>>>>>>>>>>>>>>>>>>>>> before.
>>>>>>>>>>>>>>>>>>>>>>>>>>> I checked the system OS setup as
>> well
>>>> as
>>>>>> the
>>>>>>>>>>>>>> limits.conf.
>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>> also
>>>>>>>>>>>>>>>>>>>>>>> checked
>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> running processes. The total open
>>> files
>>>>>> don't
>>>>>>>>>>> reach
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> limit.
>>>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>>> am
>>>>>>>>>>>>>>>>>>>>>>>>>> confused
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> a bit.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Please see the following master
>> log.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2009-11-13 20:06:02,114 INFO
>>> org.apache.hadoop.hbase.master.BaseScanner:
>>>>>>>>>>>>>>>>>>>>>>>>>>> RegionManager.metaScanner scan of
>>> 4658
>>>>>> row(s)
>>>>>>>>> of
>>>>>>>>>>> meta
>>>>>>>>>>>>>>>> region
>>>>>>>>>>>>>>>>>>>>>> {server:
>>>>>>>>>>>>>>>>>>>>>>>>>>> 192.168.100.128:60021, regionname:
>>>>>>> .META.,,1,
>>>>>>>>>>>>> startKey:
>>>>>>>>>>>>>>>> <>}
>>>>>>>>>>>>>>>>>>>>>> complete
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2009-11-13 20:06:02,114 INFO
>>> org.apache.hadoop.hbase.master.BaseScanner:
>>>>>>>>>>>>>>>>>>>>>>>>>> All
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> 1 .META. region(s) scanned
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2009-11-13 20:06:07,677 DEBUG
>>>>>>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:
>>>>>>>>>>>>>>>>>>>> Got
>>>>>>>>>>>>>>>>>>>>>>> ping
>>>>>>>>>>>>>>>>>>>>>>>>>>> response for
>>>> sessionid:0x424eebf1c10004c
>>>>>>> after
>>>>>>>>> 3ms
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2009-11-13 20:06:08,178 INFO
>>>>>>>>>>>>>>>> org.apache.hadoop.hdfs.DFSClient:
>>>>>>>>>>>>>>>>>>>>>>> Exception
>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> createBlockOutputStream
>>>>>> java.io.IOException:
>>>>>>>>> Bad
>>>>>>>>>>>>> connect
>>>>>>>>>>>>>>>> ack
>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>>> firstBadLink 192.168.100.123:50010
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2009-11-13 20:06:08,178 INFO
>>>>>>>>>>>>>>>> org.apache.hadoop.hdfs.DFSClient:
>>>>>>>>>>>>>>>>>>>>>>>>>>> Abandoning
>>>>>>>>>>>>>>>>>>>>>>>>>>> block
>>> blk_-2808245019291145247_5478039
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2009-11-13 20:06:09,682 INFO
>>>>>>>>>>>>>>>> org.apache.hadoop.hdfs.DFSClient:
>>>>>>>>>>>>>>>>>>>>>>> Exception
>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> createBlockOutputStream
>>>>>> java.io.EOFException
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2009-11-13 20:06:09,682 INFO
>>>>>>>>>>>>>>>> org.apache.hadoop.hdfs.DFSClient:
>>>>>>>>>>>>>>>>>>>>>>>>>>> Abandoning
>>>>>>>>>>>>>>>>>>>>>>>>>>> block
>> blk_1074853606841896259_5478048
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2009-11-13 20:06:10,334 DEBUG
>>>>>>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:
>>>>>>>>>>>>>>>>>>>> Got
>>>>>>>>>>>>>>>>>>>>>>> ping
>>>>>>>>>>>>>>>>>>>>>>>>>>> response for
>>> sessionid:0x24eebf1043003c
>>>>>> after
>>>>>>>>> 1ms
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2009-11-13 20:06:21,018 DEBUG
>>>>>>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:
>>>>>>>>>>>>>>>>>>>> Got
>>>>>>>>>>>>>>>>>>>>>>> ping
>>>>>>>>>>>>>>>>>>>>>>>>>>> response for
>>>> sessionid:0x424eebf1c10004c
>>>>>>> after
>>>>>>>>> 0ms
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2009-11-13 20:06:23,674 DEBUG
>>>>>>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:
>>>>>>>>>>>>>>>>>>>> Got
>>>>>>>>>>>>>>>>>>>>>>> ping
>>>>>>>>>>>>>>>>>>>>>>>>>>> response for
>>> sessionid:0x24eebf1043003c
>>>>>> after
>>>>>>>>> 0ms
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2009-11-13 20:06:24,828 INFO
>>>>>>>>>>>>>>>> org.apache.hadoop.hdfs.DFSClient:
>>>>>>>>>>>>>>>>>>>>>>> Exception
>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> createBlockOutputStream
>>>>>> java.io.IOException:
>>>>>>>>> Bad
>>>>>>>>>>>>> connect
>>>>>>>>>>>>>>>> ack
>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>>>>>> firstBadLink 192.168.100.123:50010
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2009-11-13 20:06:24,828 INFO
>>>>>>>>>>>>>>>> org.apache.hadoop.hdfs.DFSClient:
>>>>>>>>>>>>>>>>>>>>>>>>>>> Abandoning
>>>>>>>>>>>>>>>>>>>>>>>>>>> block
>>> blk_-6642544517082142289_5478063
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2009-11-13 20:06:24,828 INFO
>>>>>>>>>>>>>>>> org.apache.hadoop.hdfs.DFSClient:
>>>>>>>>>>>>>>>>>>>>>>> Exception
>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> createBlockOutputStream
>>>>>>>>> java.net.SocketException:
>>>>>>>>>>> Too
>>>>>>>>>>>>>> many
>>>>>>>>>>>>>>>>>>>> open
>>>>>>>>>>>>>>>>>>>>>> files
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2009-11-13 20:06:24,828 INFO
>>>>>>>>>>>>>>>> org.apache.hadoop.hdfs.DFSClient:
>>>>>>>>>>>>>>>>>>>>>>>>>>> Abandoning
>>>>>>>>>>>>>>>>>>>>>>>>>>> block
>> blk_2057511041109796090_5478063
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2009-11-13 20:06:24,928 INFO
>>>>>>>>>>>>>>>> org.apache.hadoop.hdfs.DFSClient:
>>>>>>>>>>>>>>>>>>>>>>> Exception
>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> createBlockOutputStream
>>>>>>>>> java.net.SocketException:
>>>>>>>>>>> Too
>>>>>>>>>>>>>> many
>>>>>>>>>>>>>>>>>>>> open
>>>>>>>>>>>>>>>>>>>>>> files
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2009-11-13 20:06:24,928 INFO
>>>>>>>>>>>>>>>> org.apache.hadoop.hdfs.DFSClient:
>>>>>>>>>>>>>>>>>>>>>>>>>>> Abandoning
>>>>>>>>>>>>>>>>>>>>>>>>>>> block
>> blk_8219260302213892894_5478064
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2009-11-13 20:06:30,855 INFO
>>>>>>>>>>>>>>>> org.apache.hadoop.hdfs.DFSClient:
>>>>>>>>>>>>>>>>>>>>>>> Exception
>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> createBlockOutputStream
>>>>>>>>> java.net.SocketException:
>>>>>>>>>>> Too
>>>>>>>>>>>>>> many
>>>>>>>>>>>>>>>>>>>> open
>>>>>>>>>>>>>>>>>>>>>> files
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2009-11-13 20:06:30,855 INFO
>>>>>>>>>>>>>>>> org.apache.hadoop.hdfs.DFSClient:
>>>>>>>>>>>>>>>>>>>>>>>>>>> Abandoning
>>>>>>>>>>>>>>>>>>>>>>>>>>> block
>> blk_1669205542853067709_5478235
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2009-11-13 20:06:30,905 INFO
>>>>>>>>>>>>>>>> org.apache.hadoop.hdfs.DFSClient:
>>>>>>>>>>>>>>>>>>>>>>> Exception
>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> createBlockOutputStream
>>>>>>>>> java.net.SocketException:
>>>>>>>>>>> Too
>>>>>>>>>>>>>> many
>>>>>>>>>>>>>>>>>>>> open
>>>>>>>>>>>>>>>>>>>>>> files
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2009-11-13 20:06:30,905 INFO
>>>>>>>>>>>>>>>> org.apache.hadoop.hdfs.DFSClient:
>>>>>>>>>>>>>>>>>>>>>>>>>>> Abandoning
>>>>>>>>>>>>>>>>>>>>>>>>>>> block
>> blk_9128897691346270351_5478237
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2009-11-13 20:06:30,955 INFO
>>>>>>>>>>>>>>>> org.apache.hadoop.hdfs.DFSClient:
>>>>>>>>>>>>>>>>>>>>>>> Exception
>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> createBlockOutputStream
>>>>>>>>> java.net.SocketException:
>>>>>>>>>>> Too
>>>>>>>>>>>>>> many
>>>>>>>>>>>>>>>>>>>> open
>>>>>>>>>>>>>>>>>>>>>> files
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2009-11-13 20:06:30,955 INFO
>>>>>>>>>>>>>>>> org.apache.hadoop.hdfs.DFSClient:
>>>>>>>>>>>>>>>>>>>>>>>>>>> Abandoning
>>>>>>>>>>>>>>>>>>>>>>>>>>> block
>> blk_1116845144864123018_5478240
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2009-11-13 20:06:34,372 DEBUG
>>>>>>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:
>>>>>>>>>>>>>>>>>>>> Got
>>>>>>>>>>>>>>>>>>>>>>> ping
>>>>>>>>>>>>>>>>>>>>>>>>>>> response for
>>>> sessionid:0x424eebf1c10004c
>>>>>>> after
>>>>>>>>> 0ms
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2009-11-13 20:06:37,034 DEBUG
>>>>>>>>>>>>>>>> org.apache.zookeeper.ClientCnxn:
>>>>>>>>>>>>>>>>>>>> Got
>>>>>>>>>>>>>>>>>>>>>>> ping
>>>>>>>>>>>>>>>>>>>>>>>>>>> response for
>>> sessionid:0x24eebf1043003c
>>>>>> after
>>>>>>>>> 0ms
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2009-11-13 20:06:37,235 WARN
>>>>>>>>>>>>>>>> org.apache.hadoop.hdfs.DFSClient:
>>>>>>>>>>>>>>>>>>>>>>>>>> DataStreamer
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Exception: java.io.IOException: Too
>>>> many
>>>>>> open
>>>>>>>>>>> files
>>>>>>>>>>>>>>>>>>>>>>>>>>>      at
>>>> sun.nio.ch.IOUtil.initPipe(Native
>>>>>>>>> Method)
>>>>>>>>>>>>>>>>>>>>>>>>>>>      at
>>>> sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:49)
>>>>>>>>>>>>>>>>>>>>>>>>>>>      at
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:18)
>>>>>>>>>>>>>>>>>>>>>>>>>>>      at
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get(SocketIOWithTimeout.java:407)
>>>>>>>>>>>>>>>>>>>>>>>>>>>      at
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:322)
>>>>>>>>>>>>>>>>>>>>>>>>>>>      at
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
>>>>>>>>>>>>>>>>>>>>>>>>>>>      at
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
>>>>>>>>>>>>>>>>>>>>>>>>>>>      at
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
>>>>>>>>>>>>>>>>>>>>>>>>>>>      at
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
>>>>>>>>>>>>>>>>>>>>>>>>>>>      at
>>>>>> java.io.DataOutputStream.write(DataOutputStream.java:90)
>>>>>>>>>>>>>>>>>>>>>>>>>>>      at
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2290)
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2009-11-13 20:06:37,235 WARN
>>>>>>>>>>>>>>>> org.apache.hadoop.hdfs.DFSClient:
>>>>>>>>>>>>>>>>>>>>> Error
>>>>>>>>>>>>>>>>>>>>>>>>>>> Recovery for block
>>>>>>>>> blk_8148813491785406356_5478475
>>>>>>>>>>>>> bad
>>>>>>>>>>>>>>>>>>>> datanode[0]
>>>>>>>>>>>>>>>>>>>>>>>>>>> 192.168.100.123:50010
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2009-11-13 20:06:37,235 WARN
>>>>>>>>>>>>>>>> org.apache.hadoop.hdfs.DFSClient:
>>>>>>>>>>>>>>>>>>>>> Error
>>>>>>>>>>>>>>>>>>>>>>>>>>> Recovery for block
>>>>>>>>> blk_8148813491785406356_5478475
>>>>>>>>>>> in
>>>>>>>>>>>>>>>> pipeline
>>>>>>>>>>>>>>>>>>>>>>>>>>> 192.168.100.123:50010,
>>>>>> 192.168.100.134:50010
>>>>>>> ,
>>>>>>>>>>>>>>>>>>>>> 192.168.100.122:50010
>>>>>>>>>>>>>>>>>>>>>> :
>>>>>>>>>>>>>>>>>>>>>>>>>>> bad
>>>>>>>>>>>>>>>>>>>>>>>>>>> datanode 192.168.100.123:50010
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2009-11-13 20:06:37,436 INFO
>>>>>>>>>>>>>>>> org.apache.hadoop.hdfs.DFSClient:
>>>>>>>>>>>>>>>>>>>>>>> Exception
>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> createBlockOutputStream
>>>>>>>>> java.net.SocketException:
>>>>>>>>>>> Too
>>>>>>>>>>>>>> many
>>>>>>>>>>>>>>>>>>>> open
>>>>>>>>>>>>>>>>>>>>>> files
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2009-11-13 20:06:37,436 INFO
>>>>>>>>>>>>>>>> org.apache.hadoop.hdfs.DFSClient:
>>>>>>>>>>>>>>>>>>>>>>>>>>> Abandoning
>>>>>>>>>>>>>>>>>>>>>>>>>>> block
>> blk_2119727700857186236_5478498
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Nov 12, 2009 at 4:21 PM,
>>> Zhenyu
>>>>>> Zhong
>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>> zhongresearch@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>  Will do.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> thanks
>>>>>>>>>>>>>>>>>>>>>>>>>>>> zhenyu
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Nov 12, 2009 at 3:33 PM,
>>> stack
>>>> <
>>>>>>>>>>>>>> stack@duboce.net
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>  Enable GC logging too on this
>> next
>>>> run
>>>>>> (see
>>>>>>>>>>>>>>>> hbase-env.sh).
>>>>>>>>>>>>>>>>>>>>  Lets
>>>>>>>>>>>>>>>>>>>>>> try
>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>> get
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to the bottom of whats going on.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> St.Ack
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Nov 12, 2009 at 12:29 PM,
>>>>> Zhenyu
>>>>>>>>> Zhong <
>>>>>>>>>>>>>>>>>>>>>>>>>>>> zhongresearch@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  I can switch the boxes that run
>>>>>> zookeeper
>>>>>>>>> with
>>>>>>>>>>> the
>>>>>>>>>>>>>> ones
>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>> run
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> regionservers.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I will see the results later.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> FYI. The node does have the 10
>>>> minutes
>>>>>>>>>>>>>>>>>>>>> zookeeper.session.timeout
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> value
>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> place.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> thanks
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> zhenyu
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Nov 12, 2009 at 3:21 PM,
>>>> stack
>>>>> <
>>>>>>>>>>>>>>>> stack@duboce.net>
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  On Thu, Nov 12, 2009 at 11:50
>> AM,
>>>>>> Zhenyu
>>>>>>>>> Zhong
>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> zhongresearch@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> In my cluster, half of the
>>> cluster
>>>>>> have
>>>>>>> 2
>>>>>>>>>>> disks
>>>>>>>>>>>>>> 400GB
>>>>>>>>>>>>>>>>>>>> each
>>>>>>>>>>>>>>>>>>>>> per
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> machine,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> half of the cluster have 6
>> disks
>>>> per
>>>>>>>>> machine.
>>>>>>>>>>>>>>  Maybe
>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> run
>>>>>>>>>>>>>>>>>>>>>>>>>>> zookeeper
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on the machines with 2 disks
>> and
>>>> RS
>>>>> on
>>>>>>>>>>> machines
>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>> 6
>>>>>>>>>>>>>>>>>>>> disks?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> That would make most sense
>> only
>>> in
>>>>> the
>>>>>>>>> below,
>>>>>>>>>>> it
>>>>>>>>>>>>>>>> looks
>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> RS
>>>>>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> had
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> issue had 4 disks?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  BTW, the 10 minutes
>>>>>>>>> zookeeper.session.timeout
>>>>>>>>>>>>> has
>>>>>>>>>>>>>>>> been
>>>>>>>>>>>>>>>>>>>> set
>>>>>>>>>>>>>>>>>>>>>> during
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> experiment.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> And for sure this node had it
>> in
>>>>> place?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> St.Ack
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> thanks
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> zhenyu
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Nov 12, 2009 at 2:08
>> PM,
>>>>> stack
>>>>>> <
>>>>>>>>>>>>>>>> stack@duboce.net
>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  On Thu, Nov 12, 2009 at 8:40
>>> AM,
>>>>>> Zhenyu
>>>>>>>>>>> Zhong <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> zhongresearch@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  Though I experienced 2
>>>>> regionserver
>>>>>>>>>>>>> disconnection
>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> morning,
>>>>>>>>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> looks
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> better from the regionserver
>>> log.
>>>>>>> (please
>>>>>>>>>>> see
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> following
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> log)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>> http://pastebin.com/m496dbfae
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I checked diskIO for one of
>>> the
>>>>>>>>>>>>>>>>>>>>> regionserver(192.168.100.116)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> during
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> time it disconnected.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Time: 03:04:58 AM
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Device:            tps
>>>>> Blk_read/s
>>>>>>>>>>>>> Blk_wrtn/s
>>>>>>>>>>>>>>>>>>>> Blk_read
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Blk_wrtn
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sda             105.31
>>>>  5458.83
>>>>>>>>>>> 19503.64
>>>>>>>>>>>>>>>>>>>> 9043873239
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 32312473676
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sda1              2.90
>>>> 6.64
>>>>>>>>>>>  99.25
>>>>>>>>>>>>>>>>>>>> 10993934
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 164433464
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sda2              1.72
>>>>  23.76
>>>>>>>>>>>  30.25
>>>>>>>>>>>>>>>>>>>> 39365817
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 50124008
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sda3              0.30
>>>> 0.38
>>>>>>>>>>> 3.58
>>>>>>>>>>>>>>>>>>>> 624930
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 5923000
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sda4            100.39
>>>>  5428.06
>>>>>>>>>>> 19370.56
>>>>>>>>>>>>>>>>>>>> 8992888270
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 32091993204
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  Is this high for you?  20k
>>>>>>>>> blocks/second
>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>> seem to
>>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>>> high
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>>>>>>>>> its
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> disk only and its not being
>>>> shared
>>>>> by
>>>>>>> zk
>>>>>>>>>>>>> anymore
>>>>>>>>>>>>>> so
>>>>>>>>>>>>>>>>>>>>> shouldn't
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> matter?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  I also checked the zookeeper
>>>>> quorum
>>>>>>>>> server
>>>>>>>>>>>>> that
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> regionserver
>>>>>>>>>>>>>>>>>>>>>>>>>>> tried
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connect according to the log.
>>>>>> However,
>>>>>>> I
>>>>>>>>>>> don't
>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 192.168.100.116
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> client list of the zookeeper
>>>> quorum
>>>>>>>>> server
>>>>>>>>>>> that
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> regionserver
>>>>>>>>>>>>>>>>>>>>>>>>>>> tried
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> connect.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Would that be a problem?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  Is that because the
>> ephemeral
>>>>> node
>>>>>>> for
>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> regionserver
>>>>>>>>>>>>>>>>>>>>> had
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> evaporated?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Lost
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> its lease w/ zk by the time
>> you
>>>>> went
>>>>>> to
>>>>>>>>>>> look?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  Thu Nov 12 15:04:52 UTC 2009
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Zookeeper version:
>>> 3.2.1-808558,
>>>>>> built
>>>>>>>>> on
>>>>>>>>>>>>>>>> 08/27/2009
>>>>>>>>>>>>>>>>>>>> 18:48
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> GMT
>>>>>>>>>>>>>>>>>>>>>>>>>>> Clients:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /192.168.100.127:43045
>>>>>>>>>>>>>>>> [1](queued=0,recved=26,sent=0)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /192.168.100.131:39091
>>>>>>>>>>>>>>>> [1](queued=0,recved=964,sent=0)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /192.168.100.124:35961
>>>>>>>>>>>>>>>> [1](queued=0,recved=3266,sent=0)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /192.168.100.123:47935
>>>>>>>>>>>>>>>> [1](queued=0,recved=233,sent=0)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /192.168.100.125:46931
>>>>>>>>>>>>>>>> [1](queued=0,recved=2,sent=0)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /192.168.100.118:54924
>>>>>>>>>>>>>>>> [1](queued=0,recved=295,sent=0)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /192.168.100.118:41390
>>>>>>>>>>>>>>>> [1](queued=0,recved=2290,sent=0)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /192.168.100.136:42243
>>>>>>>>>>>>>>>> [1](queued=0,recved=0,sent=0)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Latency min/avg/max:
>> 0/17/6333
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Received: 47111
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sent: 0
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Outstanding: 0
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Zxid: 0x77000083f4
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Mode: leader
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Node count: 23
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> That 6 second maximum latency
>>> is
>>>>>> pretty
>>>>>>>>> bad
>>>>>>>>>>> but
>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>>>> well
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> within
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> zk
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> session timeout.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So, problem is likely on the
>> zk
>>>>>> client
>>>>>>>>> side
>>>>>>>>>>> of
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> session;
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> i.e.
>>>>>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> RS.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> You could enable GC logging
>> as
>>>> J-D
>>>>>>>>> suggested
>>>>>>>>>>> to
>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>> if
>>>>>>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> any
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> big
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> pauses, pauses > zk session
>>>> timeout.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> When the RS went down, it
>>> didn't
>>>>> look
>>>>>>> too
>>>>>>>>>>>>> heavily
>>>>>>>>>>>>>>>>>>>> loaded:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  1. 2009-11-12 15:04:52,830
>>> INFO
>>>>>>>>>>>>>>  org.apache.hadoop.hbase.regionserver.HRegionServer:
>>>>>>>>>>>>>>>>>>>> Dump of
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> metrics:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  request=1.5166667,
>> regions=322,
>>>>>>>>> stores=657,
>>>>>>>>>>>>>>>>>>>> storefiles=631,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  storefileIndexSize=61,
>>>>>>>>> memstoreSize=1472,
>>>>>>>>>>>>>>>>>>>> usedHeap=2819,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> maxHeap=4079,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  blockCacheSize=658110960,
>>>>>>>>>>>>>> blockCacheFree=197395984,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> blockCacheCount=9903,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  blockCacheHitRatio=99
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Its serving a few reads?  The
>>>>> number
>>>>>> of
>>>>>>>>>>> store
>>>>>>>>>>>>>> files
>>>>>>>>>>>>>>>>>>>> seems
>>>>>>>>>>>>>>>>>>>>>> fine.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Not
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> too
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> much memory used.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking at the logs, I see
>> the
>>>>> Lease
>>>>>>>>> Still
>>>>>>>>>>> Held
>>>>>>>>>>>>>>>>>>>> exception.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This
>>>>>>>>>>>>>>>>>>>>>>>>>>> happens
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> when the RS does its regular
>>>> report
>>>>> to
>>>>>>> the
>>>>>>>>>>>>> master
>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> master
>>>>>>>>>>>>>>>>>>>>>>>>>>> thinks
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> RS has since restarted.
>>  It'll
>>>>> think
>>>>>>> this
>>>>>>>>>>>>> probably
>>>>>>>>>>>>>>>>>>>> because
>>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> noticed
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the RS's znode in zk had gone
>>>> away
>>>>>> and
>>>>>>> it
>>>>>>>>>>>>>> considered
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> RS
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> dead.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking too at your  logs I
>> see
>>>>> this
>>>>>>> gap
>>>>>>>>> in
>>>>>>>>>>> the
>>>>>>>>>>>>> zk
>>>>>>>>>>>>>>>>>>>> pinging:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  1. 2009-11-12 15:03:39,325
>>> DEBUG
>> org.apache.zookeeper.ClientCnxn:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Got
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  ping response for
>>>>>>>>> sessionid:0x224e55436ad0004
>>>>>>>>>>>>> after
>>>>>>>>>>>>>>>> 0ms
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  2. 2009-11-12 15:03:43,113
>>> DEBUG
>> org.apache.zookeeper.ClientCnxn:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Got
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  ping response for
>>>>>>>>> sessionid:0x24e55436a0007d
>>>>>>>>>>>>> after
>>>>>>>>>>>>>>>> 0ms
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Where in the lines above it,
>>> its
>>>>>>>>> reporting
>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>> every
>>>>>>>>>>>>>>>>>>>> ten
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> seconds,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> here
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> there is a big gap.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Do you have ganglia or
>>> something
>>>>> that
>>>>>>>>> will
>>>>>>>>>>> let
>>>>>>>>>>>>> you
>>>>>>>>>>>>>>>> look
>>>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> into
>>>>>>>>>>>>>>>>>>>>>>>>>>> what
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> was
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> happening on this machine
>>> around
>>>>> this
>>>>>>>>> time?
>>>>>>>>>>>  Is
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> machine
>>>>>>>>>>>>>>>>>>>>>> OK?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> It
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> looks
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lightly loaded and you have
>> your
>>>>>> cluster
>>>>>>>>>>> nicely
>>>>>>>>>>>>>> laid
>>>>>>>>>>>>>>>> out.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Something
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> odd
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> going on.  What about things
>>> like
>>>>> the
>>>>>>>>> write
>>>>>>>>>>>>> speed
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>> disk?
>>>>>>>>>>>>>>>>>>>>>  In
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>>>> past
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> strange issues have been
>>> explained
>>>> by
>>>>>>>>>>> incorrectly
>>>>>>>>>>>>>> set
>>>>>>>>>>>>>>>> BIOS
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>>>>>>>>>>>>> made
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> disks
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> run at 1/100th of their
>> proper
>>>>> speed.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> St.Ack
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> zhenyu
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Nov 11, 2009 at 3:58
>>> PM,
>>>>>>> Zhenyu
>>>>>>>>>>> Zhong
>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> zhongresearch@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  Stack
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I am very appreciated for
>>> your
>>>>>>>>> comments.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I will use the zookeeper
>>>>> monitoring
>>>>>>>>> script
>>>>>>>>>>> on
>>>>>>>>>>>>> my
>>>>>>>>>>>>>>>>>>>> cluster
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>>>> let
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> run
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> overnight to see the result.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I will follow up the thread
>>>> when
>>>>> I
>>>>>>> get
>>>>>>>>>>>>> anything.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> thanks
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> zhenyu
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, Nov 11, 2009 at
>> 3:52
>>>> PM,
>>>>>>> stack
>>>>>>>>> <
>>>>>>>>>>>>>>>>>>>> stack@duboce.net>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  I see these in your log
>> too:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  1. 2009-11-11
>> 04:27:19,018
>>>>> DEBUG
>>>> org.apache.zookeeper.ClientCnxn:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Got
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  ping response for
>>>>>>>>>>> sessionid:0x424dfd908c50009
>>>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 4544ms
>>>>>>>>>>>>>>>>>>>>>>>>>>>  2. 2009-11-11 04:27:19,018 DEBUG
>>>> org.apache.zookeeper.ClientCnxn:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Got
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  ping response for
>>>>>>>>>>> sessionid:0x24dfd90c810002
>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 4548ms
>>>>>>>>>>>>>>>>>>>>>>>>>>>  3. 2009-11-11 04:27:43,960 DEBUG
>>>> org.apache.zookeeper.ClientCnxn:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Got
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  ping response for
>>>>>>>>>>> sessionid:0x424dfd908c50009
>>>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 9030ms
>>>>>>>>>>>>>>>>>>>>>>>>>>>  4. 2009-11-11 04:27:43,960 DEBUG
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> org.apache.zookeeper.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
> 

Mime
View raw message