hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zhenyu Zhong <zhongresea...@gmail.com>
Subject Re: regionserver disconnection
Date Mon, 30 Nov 2009 23:04:41 GMT
Tons of thanks to Stack and J-D as well as the Zookeeper fellows.

Previously we experienced RS session timeouts even we set the
zookeeper.session.timeout to 10 minutes.
It turned out that the zookeeper has both lower bound and upper bound for
the session timeout.
It actually relates to the tickerTime, so the session timeout should be no
less than 2 times tickertime and no greater than 20 times tickerTime. By
default the tickerTime is 2 sec, so the maximum timeout is 40 seconds by
default no matter how large the value is set for zookeeper.session.timeout.

Due to some unknown reason, (probably system heavy load, GC pauses, etc),
40 seconds is not long enough for us, so we have to extend that value.

One way to do that is to up the hbase.zookeeper.property.tickTime. Now with
4000ms set on that value, we haven't seen the RS session TIMEOUT so far.

Many thanks to Stack, J-D and Zookeeper fellows.

I will keep you posted if the session timeout comes back again or there is
anything else causes the RS disconnections.

Best,
zhenyu




On Thu, Nov 19, 2009 at 1:37 PM, stack <stack@duboce.net> wrote:

> From the zk fellas, try running your zk cluster at INFO level rather than
> at
> DEBUG.  One supposition is that the shear amount of logging is causing
> distress.   They are still looking into your logs...
>
> Thanks for your patience Zhenyu.
> St.Ack
>
>
> On Thu, Nov 19, 2009 at 8:57 AM, Zhenyu Zhong <zhongresearch@gmail.com
> >wrote:
>
> > Yes, I believe so.  And I don't know why the leader gave up.
> > For this case, the regionserver got TIMEOUT warning message at 2009-11-18
> > 12:42:01,482
> >
> >
> > Best,
> > zhenyu
> >
> >
> >
> >
> > On Thu, Nov 19, 2009 at 11:43 AM, stack <stack@duboce.net> wrote:
> >
> > > So, the time at which we see
> > >
> > > 2009-11-18 12:41:56,782 ERROR
> > > org.apache.zookeeper.server.quorum.FollowerHandler: Unexpected
> exception
> > > causing shutdown while sock still open
> > >
> > > ... corresponds to the time at which the regionserver loses is session?
> > >
> > > St.Ack
> > >
> > > On Thu, Nov 19, 2009 at 8:38 AM, Zhenyu Zhong <zhongresearch@gmail.com
> > > >wrote:
> > >
> > > > Here is the zookeeper leader log. It had an unexpected exception.  I
> > > don't
> > > > know why.
> > > >
> > > > http://pastebin.com/m6594ccc
> > > >
> > > > During that time, I don't think the system load is high. Memory usage
> > > > should
> > > > be normal.
> > > >
> > > > Best,
> > > > zhenyu
> > > >
> > > >
> > > >
> > > > On Thu, Nov 19, 2009 at 11:09 AM, stack <stack@duboce.net> wrote:
> > > >
> > > > > What is in the other zk logs?  Why did the leader go away?  Memory
> > > issue
> > > > or
> > > > > something?  Why'd it shutdown?  Thanks for pastebin'ing this stuff.
> > > > > St.Ack
> > > > >
> > > > > On Thu, Nov 19, 2009 at 7:53 AM, Zhenyu Zhong <
> > zhongresearch@gmail.com
> > > > > >wrote:
> > > > >
> > > > > > After digging more on the zookeeper log, I found that there
> seemed
> > to
> > > > be
> > > > > a
> > > > > > socketconnection timeout. Also some warnings indicate that
> > zookeeper
> > > > > server
> > > > > > is not running. Probably I need to understand the mechanism of
> > > > zookeeper
> > > > > > first.
> > > > > >
> > > > > > Please find the zookeeper log around that time.
> > > > > >
> > > > > > http://pastebin.com/m37bb4ad1
> > > > > >
> > > > > > Many thanks
> > > > > > zhenyu
> > > > > >
> > > > > >
> > > > > > On Thu, Nov 19, 2009 at 10:31 AM, Zhenyu Zhong <
> > > > zhongresearch@gmail.com
> > > > > > >wrote:
> > > > > >
> > > > > > > Stack,
> > > > > > >
> > > > > > > I am very appreciated. Let me dig into the zookeeper log more.
> > > > > > >
> > > > > > > FYI, I saw the sleeper complained about the delay for the first
> > > time.
> > > > > It
> > > > > > > looks like there are different reasons for the RS
> disconnections.
> > > > > > >
> > > > > > > 2009-11-19 08:08:32,708 DEBUG org.apache.zookeeper.ClientCnxn:
> > Got
> > > > ping
> > > > > > > response for sessionid:0x42508716bfc0001 after 0ms
> > > > > > > 2009-11-19 08:08:37,595 DEBUG org.apache.zookeeper.ClientCnxn:
> > Got
> > > > ping
> > > > > > > response for sessionid:0x42508716bfc0082 after 0ms
> > > > > > > 2009-11-19 08:08:46,073 DEBUG org.apache.zookeeper.ClientCnxn:
> > Got
> > > > ping
> > > > > > > response for sessionid:0x42508716bfc0001 after 0ms
> > > > > > > 2009-11-19 08:09:39,880 WARN org.apache.zookeeper.ClientCnxn:
> > > > Exception
> > > > > > > closing session 0x42508716bfc0001 to
> > > > > sun.nio.ch.SelectionKeyImpl@4f3a608f
> > > > > > > java.io.IOException: TIMED OUT
> > > > > > >         at
> > > > > > >
> > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> > > > > > > 2009-11-19 08:09:39,880 WARN org.apache.zookeeper.ClientCnxn:
> > > > Exception
> > > > > > > closing session 0x42508716bfc0082 to
> > > > > sun.nio.ch.SelectionKeyImpl@14c8875
> > > > > > > java.io.IOException: TIMED OUT
> > > > > > >         at
> > > > > > >
> > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> > > > > > > 2009-11-19 08:09:39,880 WARN
> > org.apache.hadoop.hbase.util.Sleeper:
> > > We
> > > > > > slept
> > > > > > > 53556ms, ten times longer than scheduled: 3000
> > > > > > > 2009-11-19 08:09:39,980 WARN
> > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1
> > > > > > > org.apache.hadoop.hbase.Leases$LeaseStillHeldException
> > > > > > >         at
> > > > > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > > > > > > Method)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> > > > > > >         at
> > > > > > java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
> > > > > > >         at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:571)
> > > > > > >         at java.lang.Thread.run(Thread.java:619)
> > > > > > > 2009-11-19 08:09:39,980 INFO
> > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: Got
> ZooKeeper
> > > > > event,
> > > > > > > state: Disconnected, type: None, path: null
> > > > > > > 2009-11-19 08:09:40,530 INFO org.apache.zookeeper.ClientCnxn:
> > > > > Attempting
> > > > > > > connection to server superpyxis0005/192.168.100.119:2181
> > > > > > > 2009-11-19 08:09:40,530 INFO org.apache.zookeeper.ClientCnxn:
> > > Priming
> > > > > > > connection to java.nio.channels.SocketChannel[connected local=/
> > > > > > > 192.168.100.122:40659 remote=superpyxis0005/
> 192.168.100.119:2181
> > ]
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Best,
> > > > > > > zhenyu
> > > > > > >
> > > > > > > On Wed, Nov 18, 2009 at 6:52 PM, stack <stack@duboce.net>
> wrote:
> > > > > > >
> > > > > > >> Patrick Hunt, one of the zk lads, answered with the following:
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> According to the following snippet (I cut to just session
> > > > > > >> 0x425035c48210002)
> > > > > > >> the client is getting pings for this session id regularly
> every
> > 6
> > > > > > seconds
> > > > > > >> or
> > > > > > >> so, then after 41,016 there's a 20 second gap (corresponds to
> a
> > 30
> > > > > > second
> > > > > > >> timeout being set during session creation) at which point we
> > > > > disconnect
> > > > > > >> from
> > > > > > >> the server.
> > > > > > >>
> > > > > > >>
> > > > > > >> > 2009-11-18 12:41:34,788 DEBUG
> org.apache.zookeeper.ClientCnxn:
> > > Got
> > > > > > ping
> > > > > > >> > response for sessionid:0x425035c48210002 after 0ms
> > > > > > >> > 2009-11-18 12:42:01,482 WARN
> org.apache.zookeeper.ClientCnxn:
> > > > > > Exception
> > > > > > >> > closing session 0x425035c48210002 to
> > > > > > >> sun.nio.ch.SelectionKeyImpl@421690ab
> > > > > > >> > java.io.IOException: TIMED OUT
> > > > > > >>
> > > > > > >> the client notifies hbase of the disco, and attempts to
> > reconnect
> > > to
> > > > a
> > > > > > >> server
> > > > > > >>
> > > > > > >>
> > > > > > >> > org.apache.hadoop.hbase.regionserver.HRegionServer: Got
> > > ZooKeeper
> > > > > > event,
> > > > > > >> > state: Disconnected, type: None, path: null
> > > > > > >> > 2009-11-18 12:42:01,782 INFO
> org.apache.zookeeper.ClientCnxn:
> > > > > > Attempting
> > > > > > >> > connection to server superpyxis0003/192.168.100.117:2181
> > > > > > >>
> > > > > > >> but the server won't allow the session be be established for
> > some
> > > > > reason
> > > > > > >> (is
> > > > > > >> that the right server:port that's it's trying to connect to?)
> > > > > > >>
> > > > > > >>
> > > > > > >> > 2009-11-18 12:42:02,182 INFO
> org.apache.zookeeper.ClientCnxn:
> > > > Server
> > > > > > >> > connection successful
> > > > > > >> > 2009-11-18 12:42:02,182 WARN
> org.apache.zookeeper.ClientCnxn:
> > > > > > Exception
> > > > > > >> > closing session 0x425035c48210002 to
> > > > > > >> sun.nio.ch.SelectionKeyImpl@5c07076b
> > > > > > >> > java.io.IOException: Read error rc = -1
> > > > > > java.nio.DirectByteBuffer[pos=0
> > > > > > >> > lim=4 cap=4]
> > > > > > >>
> > > > > > >> The "read error rc = -1" means that the client read from the
> > > socket
> > > > > and
> > > > > > >> got
> > > > > > >> EOS. But it's not a session expiration afaict (and should not
> be
> > > at
> > > > > this
> > > > > > >> point in time).
> > > > > > >>
> > > > > > >>
> > > > > > >> You need to look at the ZK server logs for this period in time
> > > (for
> > > > > this
> > > > > > >> server superpyxis0003:2181, and superpyxis0001:2181, the
> client
> > > > tries
> > > > > to
> > > > > > >> connect to both of these). There should be some indication
> there
> > > why
> > > > > the
> > > > > > >> server is closing down the connection and not allowing the
> > session
> > > > to
> > > > > be
> > > > > > >> re-established.
> > > > > > >>
> > > > > > >> One thought - by default we limit clients to a max of 10
> > > connections
> > > > > > from
> > > > > > >> a
> > > > > > >> particular host to the server. If you exceed this limit the zk
> > > > server
> > > > > > >> (it's
> > > > > > >> per server though, not per cluster, so perhaps this is not it)
> > > will
> > > > > not
> > > > > > >> allow any  additional connections from the client host (you
> > can't
> > > > > exceed
> > > > > > >> 10
> > > > > > >> on a server). Could this be it? i.e. the user is running a
> > number
> > > of
> > > > > ZK
> > > > > > >> clients (say hbase processes) from the same host? Regardless,
> > get
> > > > the
> > > > > > >> server
> > > > > > >> log for this time period and it should be more clear.
> > > > > > >>
> > > > > > >> Patrick
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> On Wed, Nov 18, 2009 at 11:18 AM, stack <stack@duboce.net>
> > wrote:
> > > > > > >>
> > > > > > >> > And running stat on each of your zookeeper nodes tells you
> > what
> > > > > about
> > > > > > >> > averages for connect-times?
> > > > > > >> >
> > > > > > >> > Its odd that there is a ping just 20 seconds beforehand and
> > you
> > > > have
> > > > > > set
> > > > > > >> > the zk session timeout at 60 seconds (or ten minutes?).  Let
> > me
> > > > ask
> > > > > > the
> > > > > > >> zk
> > > > > > >> > lads about it.
> > > > > > >> >
> > > > > > >> > If you look in the GC log for around this time, do you see
> > > > anything?
> > > > > > >> >
> > > > > > >> > St.Ack
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > On Wed, Nov 18, 2009 at 7:32 AM, Zhenyu Zhong <
> > > > > > zhongresearch@gmail.com
> > > > > > >> >wrote:
> > > > > > >> >
> > > > > > >> >> After running more experiment, it seems that the
> > disconnection
> > > is
> > > > > not
> > > > > > >> >> related to the heavy load job, because the load average is
> > low,
> > > > the
> > > > > > >> diskio
> > > > > > >> >> is normal, memory Heap was not reached, also the virtual
> > memory
> > > > > stats
> > > > > > >> >> shows
> > > > > > >> >> that no swappings.  However, the disconnection still
> happens.
> > > > > > >> >>
> > > > > > >> >> It looks like this time it pauses for 20 seconds. No idea
> why
> > > > > > >> regionserver
> > > > > > >> >> disconnected.
> > > > > > >> >>
> > > > > > >> >> Any other suggestions please?
> > > > > > >> >>
> > > > > > >> >>
> > > > > > >> >> 2009-11-18 12:41:00,956 DEBUG
> > org.apache.zookeeper.ClientCnxn:
> > > > Got
> > > > > > ping
> > > > > > >> >> response for sessionid:0x125035c495d02bb after 0ms
> > > > > > >> >> 2009-11-18 12:41:08,074 DEBUG
> > org.apache.zookeeper.ClientCnxn:
> > > > Got
> > > > > > ping
> > > > > > >> >> response for sessionid:0x425035c48210002 after 0ms
> > > > > > >> >> 2009-11-18 12:41:14,302 DEBUG
> > org.apache.zookeeper.ClientCnxn:
> > > > Got
> > > > > > ping
> > > > > > >> >> response for sessionid:0x125035c495d02bb after 0ms
> > > > > > >> >> 2009-11-18 12:41:21,433 DEBUG
> > org.apache.zookeeper.ClientCnxn:
> > > > Got
> > > > > > ping
> > > > > > >> >> response for sessionid:0x425035c48210002 after 0ms
> > > > > > >> >> 2009-11-18 12:41:27,656 DEBUG
> > org.apache.zookeeper.ClientCnxn:
> > > > Got
> > > > > > ping
> > > > > > >> >> response for sessionid:0x125035c495d02bb after 0ms
> > > > > > >> >> 2009-11-18 12:41:34,788 DEBUG
> > org.apache.zookeeper.ClientCnxn:
> > > > Got
> > > > > > ping
> > > > > > >> >> response for sessionid:0x425035c48210002 after 0ms
> > > > > > >> >> 2009-11-18 12:41:41,016 DEBUG
> > org.apache.zookeeper.ClientCnxn:
> > > > Got
> > > > > > ping
> > > > > > >> >> response for sessionid:0x125035c495d02bb after 1ms
> > > > > > >> >> 2009-11-18 12:42:01,482 WARN
> org.apache.zookeeper.ClientCnxn:
> > > > > > Exception
> > > > > > >> >> closing session 0x425035c48210002 to
> > > > > > >> sun.nio.ch.SelectionKeyImpl@421690ab
> > > > > > >> >> java.io.IOException: TIMED OUT
> > > > > > >> >>        at
> > > > > > >> >>
> > > > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> > > > > > >> >> 2009-11-18 12:42:01,582 INFO
> > > > > > >> >> org.apache.hadoop.hbase.regionserver.HRegionServer: Got
> > > ZooKeeper
> > > > > > >> event,
> > > > > > >> >> state: Disconnected, type: None, path: null
> > > > > > >> >> 2009-11-18 12:42:01,782 INFO
> org.apache.zookeeper.ClientCnxn:
> > > > > > >> Attempting
> > > > > > >> >> connection to server superpyxis0003/192.168.100.117:2181
> > > > > > >> >> 2009-11-18 12:42:01,782 INFO
> org.apache.zookeeper.ClientCnxn:
> > > > > Priming
> > > > > > >> >> connection to java.nio.channels.SocketChannel[connected
> > local=/
> > > > > > >> >> 192.168.100.132:46610 remote=superpyxis0003/
> > > 192.168.100.117:2181
> > > > ]
> > > > > > >> >> 2009-11-18 12:42:01,782 INFO
> org.apache.zookeeper.ClientCnxn:
> > > > > Server
> > > > > > >> >> connection successful
> > > > > > >> >> 2009-11-18 12:42:01,782 WARN
> org.apache.zookeeper.ClientCnxn:
> > > > > > Exception
> > > > > > >> >> closing session 0x425035c48210002 to
> > > > > > >> sun.nio.ch.SelectionKeyImpl@56dc6fac
> > > > > > >> >> java.io.IOException: Read error rc = -1
> > > > > > java.nio.DirectByteBuffer[pos=0
> > > > > > >> >> lim=4 cap=4]
> > > > > > >> >>        at
> > > > > > >> >>
> > > > >
> org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
> > > > > > >> >>        at
> > > > > > >> >>
> > > > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
> > > > > > >> >> 2009-11-18 12:42:01,782 WARN
> org.apache.zookeeper.ClientCnxn:
> > > > > > Ignoring
> > > > > > >> >> exception during shutdown input
> > > > > > >> >> java.net.SocketException: Transport endpoint is not
> connected
> > > > > > >> >>         at sun.nio.ch.SocketChannelImpl.shutdown(Native
> > Method)
> > > > > > >> >>        at
> > > > > > >> >>
> > > > > >
> > > sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640)
> > > > > > >> >>        at
> > > > > > >> sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
> > > > > > >> >>         at
> > > > > > >> >>
> > > > > >
> > > org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:999)
> > > > > > >> >>        at
> > > > > > >> >>
> > > > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)
> > > > > > >> >> 2009-11-18 12:42:01,782 WARN
> org.apache.zookeeper.ClientCnxn:
> > > > > > Ignoring
> > > > > > >> >> exception during shutdown output
> > > > > > >> >> java.net.SocketException: Transport endpoint is not
> connected
> > > > > > >> >>         at sun.nio.ch.SocketChannelImpl.shutdown(Native
> > Method)
> > > > > > >> >>        at
> > > > > > >> >>
> > > > > >
> > > sun.nio.ch.SocketChannelImpl.shutdownOutput(SocketChannelImpl.java:651)
> > > > > > >> >>        at
> > > > > > >>
> sun.nio.ch.SocketAdaptor.shutdownOutput(SocketAdaptor.java:368)
> > > > > > >> >>        at
> > > > > > >> >>
> > > > > > >>
> > > > >
> > >
> org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:1004)
> > > > > > >> >>        at
> > > > > > >> >>
> > > > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)
> > > > > > >> >> 2009-11-18 12:42:02,182 INFO
> org.apache.zookeeper.ClientCnxn:
> > > > > > >> Attempting
> > > > > > >> >> connection to server superpyxis0001/192.168.100.115:2181
> > > > > > >> >> 2009-11-18 12:42:02,182 INFO
> org.apache.zookeeper.ClientCnxn:
> > > > > Priming
> > > > > > >> >> connection to java.nio.channels.SocketChannel[connected
> > local=/
> > > > > > >> >> 192.168.100.132:36197 remote=superpyxis0001/
> > > 192.168.100.115:2181
> > > > ]
> > > > > > >> >> 2009-11-18 12:42:02,182 INFO
> org.apache.zookeeper.ClientCnxn:
> > > > > Server
> > > > > > >> >> connection successful
> > > > > > >> >> 2009-11-18 12:42:02,182 WARN
> org.apache.zookeeper.ClientCnxn:
> > > > > > Exception
> > > > > > >> >> closing session 0x425035c48210002 to
> > > > > > >> sun.nio.ch.SelectionKeyImpl@5c07076b
> > > > > > >> >> java.io.IOException: Read error rc = -1
> > > > > > java.nio.DirectByteBuffer[pos=0
> > > > > > >> >> lim=4 cap=4]
> > > > > > >> >>        at
> > > > > > >> >>
> > > > >
> org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
> > > > > > >> >>        at
> > > > > > >> >>
> > > > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
> > > > > > >> >> 2009-11-18 12:42:02,182 WARN
> org.apache.zookeeper.ClientCnxn:
> > > > > > Ignoring
> > > > > > >> >> exception during shutdown input
> > > > > > >> >> java.net.SocketException: Transport endpoint is not
> connected
> > > > > > >> >>         at sun.nio.ch.SocketChannelImpl.shutdown(Native
> > Method)
> > > > > > >> >>        at
> > > > > > >> >>
> > > > > >
> > > sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640)
> > > > > > >> >>        at
> > > > > > >> sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
> > > > > > >> >>         at
> > > > > > >> >>
> > > > > >
> > > org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:999)
> > > > > > >> >>        at
> > > > > > >> >>
> > > > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)
> > > > > > >> >> 2009-11-18 12:42:02,182 WARN
> org.apache.zookeeper.ClientCnxn:
> > > > > > Ignoring
> > > > > > >> >> exception during shutdown output
> > > > > > >> >> java.net.SocketException: Transport endpoint is not
> connected
> > > > > > >> >>
> > > > > > >> >>
> > > > > > >> >> On Mon, Nov 16, 2009 at 4:16 PM, Zhenyu Zhong <
> > > > > > zhongresearch@gmail.com
> > > > > > >> >> >wrote:
> > > > > > >> >>
> > > > > > >> >> > Here is the diskIO and CPU around the time we had RS
> > > > > disconnection
> > > > > > on
> > > > > > >> >> one
> > > > > > >> >> > machine that runs RegionServer. It doesn't seem to be
> high.
> > > > > Similar
> > > > > > >> disk
> > > > > > >> >> and
> > > > > > >> >> > cpu usage have been seen before and the HBase was running
> > > fine.
> > > > > > >> >> >
> > > > > > >> >> >
> > > > > > >> >> > So far I haven't found why my 10 minutes session timeout
> > > > doesn't
> > > > > > >> apply.
> > > > > > >> >> > Still digging.
> > > > > > >> >> >
> > > > > > >> >> > Many thanks!
> > > > > > >> >> >
> > > > > > >> >> > Time: 03:01:08 PM
> > > > > > >> >> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> > > > > > >> >> >            3.82    0.00    0.89    3.97    0.00   91.32
> > > > > > >> >> >
> > > > > > >> >> >
> > > > > > >> >> > Device:            tps   Blk_read/s   Blk_wrtn/s
> Blk_read
> > > > > > >> Blk_wrtn
> > > > > > >> >> > sda              24.30      1605.39      6679.35
> 6238623388
> > > > > > >> 25956251720
> > > > > > >> >> > sda1              0.51         1.94        15.71
>  7547265
> > > > > > >> 61067904
> > > > > > >> >> > sda2              0.13         2.11         5.20
>  8202079
> > > > > > >> 20189232
> > > > > > >> >> > sda3              0.13         0.25         1.63
> 987846
> > > > > > >>  6323336
> > > > > > >> >> > sda4              0.00         0.00         0.00
>  4
> > > > > > >>  0
> > > > > > >> >> > sda5              1.78         0.06        24.14
> 220010
> > > > > > >> 93817208
> > > > > > >> >> > sda6              0.38         0.42         8.41
>  1630584
> > > > > > >> 32688152
> > > > > > >> >> > sda7             21.38      1600.61      6624.26
> 6220035272
> > > > > > >> 25742165888
> > > > > > >> >> > sdb               4.52       767.33       565.91
> 2981868690
> > > > > > >> 2199132380
> > > > > > >> >> > sdb1              4.52       767.33       565.91
> 2981866755
> > > > > > >> 2199132372
> > > > > > >> >> > sdc               4.42       742.95       563.00
> 2887151482
> > > > > > >> 2187823092
> > > > > > >> >> > sdc1              4.42       742.95       563.00
> 2887149547
> > > > > > >> 2187823084
> > > > > > >> >> > sdd               4.49       750.78       557.25
> 2917554074
> > > > > > >> 2165513500
> > > > > > >> >> > sdd1              4.49       750.78       557.25
> 2917552139
> > > > > > >> 2165513492
> > > > > > >> >> > sde               4.52       758.51       569.46
> 2947593394
> > > > > > >> 2212964236
> > > > > > >> >> > sde1              4.52       758.51       569.46
> 2947591459
> > > > > > >> 2212964228
> > > > > > >> >> > sdf               4.51       747.22       571.78
> 2903740266
> > > > > > >> 2221972652
> > > > > > >> >> > sdf1              4.51       747.22       571.78
> 2903738331
> > > > > > >> 2221972644
> > > > > > >> >> >
> > > > > > >> >> > Best,
> > > > > > >> >> > zhenyu
> > > > > > >> >> >
> > > > > > >> >> >
> > > > > > >> >> >
> > > > > > >> >> > On Mon, Nov 16, 2009 at 4:07 PM, stack <stack@duboce.net
> >
> > > > wrote:
> > > > > > >> >> >
> > > > > > >> >> >> On Mon, Nov 16, 2009 at 12:05 PM, Zhenyu Zhong <
> > > > > > >> >> zhongresearch@gmail.com
> > > > > > >> >> >> >wrote:
> > > > > > >> >> >>
> > > > > > >> >> >> > I just realized that there was a MapReduce job running
> > > > during
> > > > > > the
> > > > > > >> >> time
> > > > > > >> >> >> the
> > > > > > >> >> >> > regionserver disconnected from the zookeeper.
> > > > > > >> >> >> > That MapReduce Job was processing 500GB data and took
> > > about
> > > > 8
> > > > > > >> minutes
> > > > > > >> >> to
> > > > > > >> >> >> > finish. It launched over 2000 map tasks.
> > > > > > >> >> >>
> > > > > > >> >> >>
> > > > > > >> >> >> There was a tasktracker running on the RegionServer that
> > > > > > >> disconnected?
> > > > > > >> >>  Is
> > > > > > >> >> >> the map i/o or cpu heavy?  Do you think it could have
> > stole
> > > > life
> > > > > > >> from
> > > > > > >> >> the
> > > > > > >> >> >> datanode/regionserver?
> > > > > > >> >> >>
> > > > > > >> >> >>
> > > > > > >> >> >>
> > > > > > >> >> >> > I doubt that this introduced
> > > > > > >> >> >> > resource contention between DataNode and RegionServer
> > > Node.
> > > > > > >> >> >> > Also during the time that MapReduce job ran, I saw a
> few
> > > > > errors
> > > > > > >> >> >> indicating
> > > > > > >> >> >> > that
> > > > > > >> >> >> >
> > > > > > >> >> >> > java.io.IOException: Could not obtain block:
> > > > > > >> >> >> > blk_7661247556283580226_4450462
> > > > > file=/data/xxx.txt.200910141159
> > > > > > >> >> >> >        at
> > > > > > >> >> >> >
> > > > > > >> >> >>
> > > > > > >> >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1787)
> > > > > > >> >> >> >        at
> > > > > > >> >> >> >
> > > > > > >> >> >>
> > > > > > >> >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1615)
> > > > > > >> >> >> >        at
> > > > > > >> >> >> >
> > > > > > >> >> >>
> > > > > > >> >>
> > > > > > >>
> > > > > >
> > > >
> > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1742)
> > > > > > >> >> >> >        at
> > > > > java.io.DataInputStream.read(DataInputStream.java:83)
> > > > > > >> >> >> >        at
> > > > > > >> >> >>
> > > > org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
> > > > > > >> >> >> >        at
> > > > > > >> >> >> >
> > > > > > >> >> >>
> > > > > > >> >>
> > > > > > >>
> > > > > >
> > > >
> > org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
> > > > > > >> >> >> >        at
> > > > > > >> >> >> >
> > > > > > >> >>
> > > > > > >>
> > > > >
> > >
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
> > > > > > >> >> >> >        at
> > > > > > >> >> >> >
> > > > > > >> >> >>
> > > > > > >> >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
> > > > > > >> >> >> >        at
> > > > > > >> >> >> >
> > > > > > >> >> >>
> > > > > > >> >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
> > > > > > >> >> >> >        at
> > > > > > >> org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> > > > > > >> >> >> >        at
> > > > > > >> >> >>
> > > > org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> > > > > > >> >> >> >        at
> > > > > org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> > > > > > >> >> >> >        at
> > > > org.apache.hadoop.mapred.Child.main(Child.java:170)
> > > > > > >> >> >> >
> > > > > > >> >> >>
> > > > > > >> >> >> This is interesting.  Your map task is running into
> > > hdfs-127.
> > > > > >  Yeah,
> > > > > > >> >> you
> > > > > > >> >> >> should patch your cluster if you want to get rid of
> these
> > > > > > (hdfs-127
> > > > > > >> has
> > > > > > >> >> >> been
> > > > > > >> >> >> applied to 0.21 hadoop and will be in the next hadoop
> > > release
> > > > on
> > > > > > >> 0.20.x
> > > > > > >> >> >> branch, hadoop 0.20.2).
> > > > > > >> >> >>
> > > > > > >> >> >>
> > > > > > >> >> >>
> > > > > > >> >> >> >
> > > > > > >> >> >> > Possibly it is related to HDFS-127, but I don't see
> any
> > > > > > datanodes
> > > > > > >> >> went
> > > > > > >> >> >> > down. Does that affect Regionserver? Shall we apply
> the
> > > > patch?
> > > > > > >> >> >> >
> > > > > > >> >> >> >
> > > > > > >> >> >> Read up on hdfs-127 for explaination of whats going on.
> > > > > > >> >> >>
> > > > > > >> >> >>
> > > > > > >> >> >>
> > > > > > >> >> >> > Now I start to keep tracking of the virtual memory
> stats
> > > to
> > > > > see
> > > > > > if
> > > > > > >> >> the
> > > > > > >> >> >> same
> > > > > > >> >> >> > issue happens tomorrow around the same time period.
> > > > > > >> >> >> >
> > > > > > >> >> >>
> > > > > > >> >> >> Excellent.  Thanks for doing this.
> > > > > > >> >> >>
> > > > > > >> >> >>
> > > > > > >> >> >>
> > > > > > >> >> >> > And I highly suspect that this particular MapReduce
> job
> > > > hurts
> > > > > > >> HBase
> > > > > > >> >> >> > Regionserver.
> > > > > > >> >> >> >
> > > > > > >> >> >> > PS. I also use gcviewer to parse the GC-log, I only
> see
> > > > around
> > > > > > 30
> > > > > > >> >> >> seconds
> > > > > > >> >> >> > pauses maximum.
> > > > > > >> >> >> >
> > > > > > >> >> >>
> > > > > > >> >> >>
> > > > > > >> >> >> Did you figure why your ten minute session is not
> > applying?
> > > > > > >> >> >>
> > > > > > >> >> >> St.Ack
> > > > > > >> >> >>
> > > > > > >> >> >>
> > > > > > >> >> >>
> > > > > > >> >> >> >
> > > > > > >> >> >> > Thanks a lot.
> > > > > > >> >> >> > zhenyu
> > > > > > >> >> >> >
> > > > > > >> >> >> >
> > > > > > >> >> >> >
> > > > > > >> >> >> > On Sun, Nov 15, 2009 at 12:52 AM, Zhenyu Zhong <
> > > > > > >> >> zhongresearch@gmail.com
> > > > > > >> >> >> > >wrote:
> > > > > > >> >> >> >
> > > > > > >> >> >> > > J-D,
> > > > > > >> >> >> > >
> > > > > > >> >> >> > > Thank you very much for your comments.
> > > > > > >> >> >> > > My company block the IRC port, so I have trouble to
> > > > connect
> > > > > to
> > > > > > >> IRC
> > > > > > >> >> >> > channel.
> > > > > > >> >> >> > > I have been trying to ask the IT to open the IRC
> port
> > > for
> > > > > me,
> > > > > > it
> > > > > > >> >> might
> > > > > > >> >> >> > take
> > > > > > >> >> >> > > a while.
> > > > > > >> >> >> > >
> > > > > > >> >> >> > > Best,
> > > > > > >> >> >> > > zhenyu
> > > > > > >> >> >> > >
> > > > > > >> >> >> > >
> > > > > > >> >> >> > > On Sat, Nov 14, 2009 at 2:21 PM, Jean-Daniel Cryans
> <
> > > > > > >> >> >> jdcryans@apache.org
> > > > > > >> >> >> > >wrote:
> > > > > > >> >> >> > >
> > > > > > >> >> >> > >> The error you are getting is a disconnection from a
> > > > > zookeeper
> > > > > > >> >> server
> > > > > > >> >> >> > >> and is very generic.
> > > > > > >> >> >> > >>
> > > > > > >> >> >> > >> ZK-86 is still opened and the last comment refers
> to
> > > > ZK-111
> > > > > > >> saying
> > > > > > >> >> >> > >> that the bug (in unit tests) was probably fixed in
> > > > release
> > > > > > >> 3.0.0
> > > > > > >> >> last
> > > > > > >> >> >> > >> year.
> > > > > > >> >> >> > >>
> > > > > > >> >> >> > >> To figure the hang you have, you can try to jstack
> > the
> > > > > > process
> > > > > > >> pid
> > > > > > >> >> >> and
> > > > > > >> >> >> > >> see exactly what's holding the RS from shutting
> down.
> > > > > > >> >> >> > >>
> > > > > > >> >> >> > >> Would it be possible for you to drop by the IRC
> > > channel?
> > > > > This
> > > > > > >> way
> > > > > > >> >> we
> > > > > > >> >> >> > >> can debug this a at much faster pace.
> > > > > > >> >> >> > >>
> > > > > > >> >> >> > >> Thx!
> > > > > > >> >> >> > >>
> > > > > > >> >> >> > >> J-D
> > > > > > >> >> >> > >>
> > > > > > >> >> >> > >> On Sat, Nov 14, 2009 at 9:57 AM, Zhenyu Zhong <
> > > > > > >> >> >> zhongresearch@gmail.com>
> > > > > > >> >> >> > >> wrote:
> > > > > > >> >> >> > >> > I found this.
> > > > > > >> >> >> > >> >
> http://issues.apache.org/jira/browse/ZOOKEEPER-86
> > > > > > >> >> >> > >> >
> > > > > > >> >> >> > >> > It looks like the same error I had. Is it a
> > zookeeper
> > > > > bug?
> > > > > > >> When
> > > > > > >> >> >> will
> > > > > > >> >> >> > >> HBase
> > > > > > >> >> >> > >> > take the zookeeper version 3.3.0?
> > > > > > >> >> >> > >> >
> > > > > > >> >> >> > >> > thanks
> > > > > > >> >> >> > >> > zhenyu
> > > > > > >> >> >> > >> >
> > > > > > >> >> >> > >> > On Sat, Nov 14, 2009 at 11:31 AM, Zhenyu Zhong <
> > > > > > >> >> >> > zhongresearch@gmail.com
> > > > > > >> >> >> > >> >wrote:
> > > > > > >> >> >> > >> >
> > > > > > >> >> >> > >> >> Now I really doubt about the zookeeper, from the
> > log
> > > I
> > > > > saw
> > > > > > >> >> errors
> > > > > > >> >> >> > like
> > > > > > >> >> >> > >> >> IOException Read Error, while the zookeeper
> client
> > > > > > >> >> (regionserver )
> > > > > > >> >> >> > >> tried to
> > > > > > >> >> >> > >> >> read. But it got disconneted status from
> > zookeeper.
> > > > > > >> >> >> > >> >>
> > > > > > >> >> >> > >> >> I don't see any load on any zookeeper quorum
> > > servers.
> > > > > > DiskIO
> > > > > > >> is
> > > > > > >> >> >> fine.
> > > > > > >> >> >> > >> >>
> > > > > > >> >> >> > >> >> Also when the regionserver decides to exit due
> to
> > > the
> > > > > > >> >> disconnect
> > > > > > >> >> >> > status
> > > > > > >> >> >> > >> >> from zookeeper, sometimes the regionserver hangs
> > > > during
> > > > > > the
> > > > > > >> >> >> exiting.
> > > > > > >> >> >> > We
> > > > > > >> >> >> > >> can
> > > > > > >> >> >> > >> >> still see the HRegionServer process even we
> don't
> > > see
> > > > it
> > > > > > in
> > > > > > >> the
> > > > > > >> >> >> > Master
> > > > > > >> >> >> > >> web
> > > > > > >> >> >> > >> >> interface.
> > > > > > >> >> >> > >> >>
> > > > > > >> >> >> > >> >> I also notice that there is an zookeeper.retries
> > and
> > > > > > >> >> >> zookeeper.pauses
> > > > > > >> >> >> > >> >> settings in hbase-default.xml, would that matter
> ?
> > > > > > >> >> >> > >> >>
> > > > > > >> >> >> > >> >> thanks
> > > > > > >> >> >> > >> >>
> > > > > > >> >> >> > >> >> You may see the IOException error in the
> following
> > > log
> > > > > > when
> > > > > > >> the
> > > > > > >> >> >> > >> >> Regionserver lost connection to the zookeeper.
> > > > > > >> >> >> > >> >>
> > > > > > >> >> >> > >> >> 2009-11-14 15:58:31,993 INFO
> > > > > > >> >> >> > org.apache.hadoop.hbase.regionserver.HLog:
> > > > > > >> >> >> > >> >> Roll
> > > > > > >> >> >> > >> >>
> > > > > > >> >> >> > >>
> > > > > > >> >> >> >
> > > > > > >> >> >>
> > > > > > >> >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> /hbase/.logs/superpyxis0008.scur.colo,60021,1258179101617/hlog.dat.1258179106921,
> > > > > > >> >> >> > >> >> entries=1062964, calcsize=255014936,
> > > > filesize=139566093.
> > > > > > New
> > > > > > >> >> hlog
> > > > > > >> >> >> > >> >>
> > > > > > >> >> >> > >>
> > > > > > >> >> >> >
> > > > > > >> >> >>
> > > > > > >> >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> /hbase/.logs/superpyxis0008.scur.colo,60021,1258179101617/hlog.dat.1258214311993
> > > > > > >> >> >> > >> >> 2009-11-14 15:58:39,295 DEBUG
> > > > > > >> org.apache.zookeeper.ClientCnxn:
> > > > > > >> >> Got
> > > > > > >> >> >> > ping
> > > > > > >> >> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after
> 0ms
> > > > > > >> >> >> > >> >> 2009-11-14 15:58:52,648 DEBUG
> > > > > > >> org.apache.zookeeper.ClientCnxn:
> > > > > > >> >> Got
> > > > > > >> >> >> > ping
> > > > > > >> >> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after
> 0ms
> > > > > > >> >> >> > >> >> 2009-11-14 15:59:06,007 DEBUG
> > > > > > >> org.apache.zookeeper.ClientCnxn:
> > > > > > >> >> Got
> > > > > > >> >> >> > ping
> > > > > > >> >> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after
> 0ms
> > > > > > >> >> >> > >> >> 2009-11-14 15:59:19,365 DEBUG
> > > > > > >> org.apache.zookeeper.ClientCnxn:
> > > > > > >> >> Got
> > > > > > >> >> >> > ping
> > > > > > >> >> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after
> 0ms
> > > > > > >> >> >> > >> >> 2009-11-14 15:59:32,867 DEBUG
> > > > > > >> org.apache.zookeeper.ClientCnxn:
> > > > > > >> >> Got
> > > > > > >> >> >> > ping
> > > > > > >> >> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after
> > 186ms
> > > > > > >> >> >> > >> >> 2009-11-14 15:59:46,070 DEBUG
> > > > > > >> org.apache.zookeeper.ClientCnxn:
> > > > > > >> >> Got
> > > > > > >> >> >> > ping
> > > > > > >> >> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after
> 0ms
> > > > > > >> >> >> > >> >> 2009-11-14 15:59:59,423 DEBUG
> > > > > > >> org.apache.zookeeper.ClientCnxn:
> > > > > > >> >> Got
> > > > > > >> >> >> > ping
> > > > > > >> >> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after
> 0ms
> > > > > > >> >> >> > >> >> 2009-11-14 16:00:12,775 DEBUG
> > > > > > >> org.apache.zookeeper.ClientCnxn:
> > > > > > >> >> Got
> > > > > > >> >> >> > ping
> > > > > > >> >> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after
> 0ms
> > > > > > >> >> >> > >> >> 2009-11-14 16:00:28,141 DEBUG
> > > > > > >> org.apache.zookeeper.ClientCnxn:
> > > > > > >> >> Got
> > > > > > >> >> >> > ping
> > > > > > >> >> >> > >> >> response for sessionid:0x424f0ed01ea00a6 after
> > > 2010ms
> > > > > > >> >> >> > >> >> 2009-11-14 16:01:10,378 WARN
> > > > > > >> org.apache.zookeeper.ClientCnxn:
> > > > > > >> >> >> > Exception
> > > > > > >> >> >> > >> >> closing session 0x424f0ed01ea00a6 to
> > > > > > >> >> >> > >> sun.nio.ch.SelectionKeyImpl@39fbb2d6
> > > > > > >> >> >> > >> >>
> > > > > > >> >> >> > >> >> java.io.IOException: Read error rc = -1
> > > > > > >> >> >> > java.nio.DirectByteBuffer[pos=0
> > > > > > >> >> >> > >> >> lim=4 cap=4]
> > > > > > >> >> >> > >> >>         at
> > > > > > >> >> >> > >> >>
> > > > > > >> >> >>
> > > > > >
> > org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
> > > > > > >> >> >> > >> >>         at
> > > > > > >> >> >> > >> >>
> > > > > > >> >> >>
> > > > > >
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
> > > > > > >> >> >> > >> >> 2009-11-14 16:01:10,478 INFO
> > > > > > >> >> >> > >> >>
> > org.apache.hadoop.hbase.regionserver.HRegionServer:
> > > > Got
> > > > > > >> >> ZooKeeper
> > > > > > >> >> >> > >> event,
> > > > > > >> >> >> > >> >> state: Disconnected, type: None, path: null
> > > > > > >> >> >> > >> >> 2009-11-14 16:01:11,333 WARN
> > > > > > >> >> >> > >> >>
> > org.apache.hadoop.hbase.regionserver.HRegionServer:
> > > > > > >> Attempt=1
> > > > > > >> >> >> > >> >>
> > > > > > >> >> >> > >> >>
> > > org.apache.hadoop.hbase.Leases$LeaseStillHeldException
> > > > > > >> >> >> > >> >>         at
> > > > > > >> >> >> > >>
> > > > > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > > > > > >> >> >> > >> >> Method)
> > > > > > >> >> >> > >> >>         at
> > > > > > >> >> >> > >> >>
> > > > > > >> >> >> > >>
> > > > > > >> >> >> >
> > > > > > >> >> >>
> > > > > > >> >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> > > > > > >> >> >> > >> >>         at
> > > > > > >> >> >> > >> >>
> > > > > > >> >> >> > >>
> > > > > > >> >> >> >
> > > > > > >> >> >>
> > > > > > >> >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> > > > > > >> >> >> > >> >>         at
> > > > > > >> >> >> > >>
> > > > > > java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> > > > > > >> >> >> > >> >>         at
> > > > > > >> >> >> > >> >>
> > > > > > >> >> >> > >>
> > > > > > >> >> >> >
> > > > > > >> >> >>
> > > > > > >> >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
> > > > > > >> >> >> > >> >>         at
> > > > > > >> >> >> > >> >>
> > > > > > >> >> >> > >>
> > > > > > >> >> >> >
> > > > > > >> >> >>
> > > > > > >> >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
> > > > > > >> >> >> > >> >>         at
> > > > > > >> >> >> > >> >>
> > > > > > >> >> >> > >>
> > > > > > >> >> >> >
> > > > > > >> >> >>
> > > > > > >> >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
> > > > > > >> >> >> > >> >>         at
> > > > > > >> >> >> > >> >>
> > > > > > >> >> >> > >>
> > > > > > >> >> >> >
> > > > > > >> >> >>
> > > > > > >> >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:571)
> > > > > > >> >> >> > >> >>         at java.lang.Thread.run(Thread.java:619)
> > > > > > >> >> >> > >> >> 2009-11-14 16:01:11,433 INFO
> > > > > > >> org.apache.zookeeper.ClientCnxn:
> > > > > > >> >> >> > >> Attempting
> > > > > > >> >> >> > >> >> connection to server superpyxis0001/
> > > > > 192.168.100.115:2181
> > > > > > >> >> >> > >> >> 2009-11-14 16:01:11,433 INFO
> > > > > > >> org.apache.zookeeper.ClientCnxn:
> > > > > > >> >> >> Priming
> > > > > > >> >> >> > >> >> connection to
> > > > java.nio.channels.SocketChannel[connected
> > > > > > >> local=/
> > > > > > >> >> >> > >> >> 192.168.100.122:59575 remote=superpyxis0001/
> > > > > > >> >> 192.168.100.115:2181]
> > > > > > >> >> >> > >> >> 2009-11-14 16:01:11,433 INFO
> > > > > > >> org.apache.zookeeper.ClientCnxn:
> > > > > > >> >> >> Server
> > > > > > >> >> >> > >> >> connection successful
> > > > > > >> >> >> > >> >> 2009-11-14 16:01:11,433 INFO
> > > > > > >> >> >> > >> >>
> > org.apache.hadoop.hbase.regionserver.HRegionServer:
> > > > Got
> > > > > > >> >> ZooKeeper
> > > > > > >> >> >> > >> event,
> > > > > > >> >> >> > >> >> state: Expired, type: None, path: null
> > > > > > >> >> >> > >> >> 2009-11-14 16:01:11,433 WARN
> > > > > > >> org.apache.zookeeper.ClientCnxn:
> > > > > > >> >> >> > Exception
> > > > > > >> >> >> > >> >> closing session 0x424f0ed01ea00a6 to
> > > > > > >> >> >> > >> sun.nio.ch.SelectionKeyImpl@6b93d343
> > > > > > >> >> >> > >> >> java.io.IOException: Session Expired
> > > > > > >> >> >> > >> >>         at
> > > > > > >> >> >> > >> >>
> > > > > > >> >> >> > >>
> > > > > > >> >> >> >
> > > > > > >> >> >>
> > > > > > >> >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
> > > > > > >> >> >> > >> >>         at
> > > > > > >> >> >> > >> >>
> > > > > > >> >> >>
> > > > > >
> > org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
> > > > > > >> >> >> > >> >>
> > > > > > >> >> >> > >> >>         at
> > > > > > >> >> >> > >> >>
> > > > > > >> >> >>
> > > > > >
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
> > > > > > >> >> >> > >> >> 2009-11-14 16:01:11,433 ERROR
> > > > > > >> >> >> > >> >>
> > org.apache.hadoop.hbase.regionserver.HRegionServer:
> > > > > > >> ZooKeeper
> > > > > > >> >> >> session
> > > > > > >> >> >> > >> >> expired
> > > > > > >> >> >> > >> >> 2009-11-14 16:01:13,513 INFO
> > > > > > >> org.apache.hadoop.ipc.HBaseServer:
> > > > > > >> >> >> IPC
> > > > > > >> >> >> > >> Server
> > > > > > >> >> >> > >> >> handler 3 on 60021, call put([B@73deb204,
> > > > > > >> >> >> > >> >> [Lorg.apache.hadoop.hbase.client.Put;@2179600a)
> > > from
> > > > > > >> >> >> > >> 192.168.100.132:40728:
> > > > > > >> >> >> > >> >> error: java.io.IOException: Server not running,
> > > > aborting
> > > > > > >> >> >> > >> >> java.io.IOException: Server not running,
> aborting
> > > > > > >> >> >> > >> >>
> > > > > > >> >> >> > >> >>
> > > > > > >> >> >> > >> >>
> > > > > > >> >> >> > >> >>
> > > > > > >> >> >> > >> >>
> > > > > > >> >> >> > >> >> On Sat, Nov 14, 2009 at 1:01 AM, Zhenyu Zhong <
> > > > > > >> >> >> > zhongresearch@gmail.com
> > > > > > >> >> >> > >> >wrote:
> > > > > > >> >> >> > >> >>
> > > > > > >> >> >> > >> >>> Stack,
> > > > > > >> >> >> > >> >>>
> > > > > > >> >> >> > >> >>> Thanks a lot!
> > > > > > >> >> >> > >> >>>
> > > > > > >> >> >> > >> >>> I found out that the reason why HBase doesn't
> > take
> > > > the
> > > > > > >> system
> > > > > > >> >> >> file
> > > > > > >> >> >> > >> >>> descriptor value I set before. I started the
> > HBase
> > > > > using
> > > > > > >> root
> > > > > > >> >> >> > instead
> > > > > > >> >> >> > >> of the
> > > > > > >> >> >> > >> >>> normal hadoop user, while my system configures
> > > hadoop
> > > > > > with
> > > > > > >> >> higher
> > > > > > >> >> >> > file
> > > > > > >> >> >> > >> >>> descriptor value but configures root with 1024
> > > > default
> > > > > > >> value.
> > > > > > >> >> >> > >> >>>
> > > > > > >> >> >> > >> >>>
> > > > > > >> >> >> > >> >>> Now my system has a clean start. However, it
> > seems
> > > > that
> > > > > > the
> > > > > > >> >> >> > >> >>> zookeeper.session.timeout value doesn't take
> into
> > > > > effect.
> > > > > > I
> > > > > > >> >> still
> > > > > > >> >> >> > >> found
> > > > > > >> >> >> > >> >>> around 60 seconds pauses from the disconnected
> > > > > > >> regionserver. I
> > > > > > >> >> >> > really
> > > > > > >> >> >> > >> don't
> > > > > > >> >> >> > >> >>> know why regionserver only times out after 60
> > > seconds
> > > > > > >> instead
> > > > > > >> >> of
> > > > > > >> >> >> 10
> > > > > > >> >> >> > >> minutes
> > > > > > >> >> >> > >> >>> which I set for zookeeper.session.timeout.
> > > > > > >> >> >> > >> >>>
> > > > > > >> >> >> > >> >>> Is there any other timeout value coming into
> play
> > > > > before
> > > > > > >> the
> > > > > > >> >> >> actual
> > > > > > >> >> >> > >> >>> session times out?
> > > > > > >> >> >> > >> >>>
> > > > > > >> >> >> > >> >>> zhenyu
> > > > > > >> >> >> > >> >>>
> > > > > > >> >> >> > >> >>>
> > > > > > >> >> >> > >> >>>
> > > > > > >> >> >> > >> >>> On Fri, Nov 13, 2009 at 7:08 PM, stack <
> > > > > stack@duboce.net
> > > > > > >
> > > > > > >> >> wrote:
> > > > > > >> >> >> > >> >>>
> > > > > > >> >> >> > >> >>>> Ok.  Lack of file descriptors manifests in all
> > > kinds
> > > > > of
> > > > > > >> weird
> > > > > > >> >> >> ways.
> > > > > > >> >> >> > >> >>>> Hopefully thats it.  If not, lets keep
> digging.
> > > > > > >> >> >> > >> >>>> St.Ack
> > > > > > >> >> >> > >> >>>>
> > > > > > >> >> >> > >> >>>> On Fri, Nov 13, 2009 at 3:44 PM, Zhenyu Zhong
> <
> > > > > > >> >> >> > >> zhongresearch@gmail.com
> > > > > > >> >> >> > >> >>>> >wrote:
> > > > > > >> >> >> > >> >>>>
> > > > > > >> >> >> > >> >>>> > Stack,
> > > > > > >> >> >> > >> >>>> >
> > > > > > >> >> >> > >> >>>> > You are right, the master started with
> ulimit
> > -n
> > > > > 1024.
> > > > > > >> It
> > > > > > >> >> >> doesn't
> > > > > > >> >> >> > >> take
> > > > > > >> >> >> > >> >>>> the
> > > > > > >> >> >> > >> >>>> > system value.
> > > > > > >> >> >> > >> >>>> >
> > > > > > >> >> >> > >> >>>> > Regarding to the too many open files, it
> looks
> > > > like
> > > > > > the
> > > > > > >> >> same
> > > > > > >> >> >> as
> > > > > > >> >> >> > the
> > > > > > >> >> >> > >> one
> > > > > > >> >> >> > >> >>>> J-D
> > > > > > >> >> >> > >> >>>> > put up. But I will get the Master start with
> > > > higher
> > > > > > >> value
> > > > > > >> >> >> first
> > > > > > >> >> >> > and
> > > > > > >> >> >> > >> see
> > > > > > >> >> >> > >> >>>> if
> > > > > > >> >> >> > >> >>>> > this kind of error goes away.
> > > > > > >> >> >> > >> >>>> >
> > > > > > >> >> >> > >> >>>> > thanks a lot!
> > > > > > >> >> >> > >> >>>> > zhenyu
> > > > > > >> >> >> > >> >>>> >
> > > > > > >> >> >> > >> >>>> > On Fri, Nov 13, 2009 at 6:02 PM, stack <
> > > > > > >> stack@duboce.net>
> > > > > > >> >> >> wrote:
> > > > > > >> >> >> > >> >>>> >
> > > > > > >> >> >> > >> >>>> > > Does it say
> > > > > > >> >> >> > >> >>>> > >
> > > > > > >> >> >> > >> >>>> > > ulimit -n 32768
> > > > > > >> >> >> > >> >>>> > >
> > > > > > >> >> >> > >> >>>> > > ...as the second line in your log file on
> > > start
> > > > of
> > > > > > the
> > > > > > >> >> >> master?
> > > > > > >> >> >> > >> >>>> > >
> > > > > > >> >> >> > >> >>>> > > Is the stack trace that complains about
> too
> > > many
> > > > > > open
> > > > > > >> >> files
> > > > > > >> >> >> > same
> > > > > > >> >> >> > >> as
> > > > > > >> >> >> > >> >>>> the
> > > > > > >> >> >> > >> >>>> > one
> > > > > > >> >> >> > >> >>>> > > in the blog post J-D put up?
> > > > > > >> >> >> > >> >>>> > >
> > > > > > >> >> >> > >> >>>> > > St.Ack
> > > > > > >> >> >> > >> >>>> > >
> > > > > > >> >> >> > >> >>>> > >
> > > > > > >> >> >> > >> >>>> > > On Fri, Nov 13, 2009 at 1:37 PM, Zhenyu
> > Zhong
> > > <
> > > > > > >> >> >> > >> >>>> zhongresearch@gmail.com
> > > > > > >> >> >> > >> >>>> > > >wrote:
> > > > > > >> >> >> > >> >>>> > >
> > > > > > >> >> >> > >> >>>> > > > The ulimit file descriptors was set to
> > > > > fs.file-max
> > > > > > >> >> >> =1578334,
> > > > > > >> >> >> > >> also
> > > > > > >> >> >> > >> >>>> in
> > > > > > >> >> >> > >> >>>> > the
> > > > > > >> >> >> > >> >>>> > > > limits.conf the value was set to 32768.
> > > > > > >> >> >> > >> >>>> > > > So these are way higher than the open
> > > > > descriptors
> > > > > > >> for
> > > > > > >> >> the
> > > > > > >> >> >> > >> running
> > > > > > >> >> >> > >> >>>> > > > processes.
> > > > > > >> >> >> > >> >>>> > > >
> > > > > > >> >> >> > >> >>>> > > > thanks
> > > > > > >> >> >> > >> >>>> > > > zhenyu
> > > > > > >> >> >> > >> >>>> > > >
> > > > > > >> >> >> > >> >>>> > > >
> > > > > > >> >> >> > >> >>>> > > > On Fri, Nov 13, 2009 at 4:33 PM, Stack <
> > > > > > >> >> >> saint.ack@gmail.com>
> > > > > > >> >> >> > >> >>>> wrote:
> > > > > > >> >> >> > >> >>>> > > >
> > > > > > >> >> >> > >> >>>> > > > > You upped the ulimit file descriptors
> as
> > > per
> > > > > the
> > > > > > >> >> getting
> > > > > > >> >> >> > >> started
> > > > > > >> >> >> > >> >>>> doc?
> > > > > > >> >> >> > >> >>>> > > > >
> > > > > > >> >> >> > >> >>>> > > > >
> > > > > > >> >> >> > >> >>>> > > > >
> > > > > > >> >> >> > >> >>>> > > > > On Nov 13, 2009, at 1:26 PM, Zhenyu
> > Zhong
> > > <
> > > > > > >> >> >> > >> >>>> zhongresearch@gmail.com>
> > > > > > >> >> >> > >> >>>> > > > wrote:
> > > > > > >> >> >> > >> >>>> > > > >
> > > > > > >> >> >> > >> >>>> > > > >  Thanks a lot.
> > > > > > >> >> >> > >> >>>> > > > >>
> > > > > > >> >> >> > >> >>>> > > > >>
> > > > > > >> >> >> > >> >>>> > > > >> Bad news is my kernel is still
> 2.6.26.
> > > > > > >> >> >> > >> >>>> > > > >> But it was not a problem before.
> > > > > > >> >> >> > >> >>>> > > > >>
> > > > > > >> >> >> > >> >>>> > > > >> Very strange.
> > > > > > >> >> >> > >> >>>> > > > >>
> > > > > > >> >> >> > >> >>>> > > > >> zhenyu
> > > > > > >> >> >> > >> >>>> > > > >>
> > > > > > >> >> >> > >> >>>> > > > >> On Fri, Nov 13, 2009 at 4:16 PM,
> > > > Jean-Daniel
> > > > > > >> Cryans
> > > > > > >> >> <
> > > > > > >> >> >> > >> >>>> > > > jdcryans@apache.org
> > > > > > >> >> >> > >> >>>> > > > >> >wrote:
> > > > > > >> >> >> > >> >>>> > > > >>
> > > > > > >> >> >> > >> >>>> > > > >>  Looks like
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > >
> > > > > > >> >> >> > >> >>>> > >
> > > > > > >> >> >> > >> >>>> >
> > > > > > >> >> >> > >> >>>>
> > > > > > >> >> >> > >>
> > > > > > >> >> >> >
> > > > > > >> >> >>
> > > > > > >> >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> http://pero.blogs.aprilmayjune.org/2009/01/22/hadoop-and-linux-kernel-2627-epoll-limits/
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>> J-D
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>> On Fri, Nov 13, 2009 at 1:12 PM,
> > Zhenyu
> > > > > Zhong
> > > > > > <
> > > > > > >> >> >> > >> >>>> > > zhongresearch@gmail.com
> > > > > > >> >> >> > >> >>>> > > > >
> > > > > > >> >> >> > >> >>>> > > > >>> wrote:
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>> Hi,
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>>> After I re-organize the cluster,
> the
> > > > > > experiment
> > > > > > >> >> ran
> > > > > > >> >> >> into
> > > > > > >> >> >> > >> >>>> problem
> > > > > > >> >> >> > >> >>>> > > > faster
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>> than
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>> before.
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>>> Basically the changes are to use
> > > machines
> > > > > > with
> > > > > > >> >> less
> > > > > > >> >> >> > >> resources
> > > > > > >> >> >> > >> >>>> as
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>> zookeeper
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>> quorums and machines with more
> > > resources
> > > > as
> > > > > > >> >> >> > regionserver.
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>>> From the log, I still see the pause
> > > > around
> > > > > 1
> > > > > > >> >> minute.
> > > > > > >> >> >> > >> >>>> > > > >>>> I enabled the GC logging,  please
> see
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>>> http://pastebin.com/m1d4ce0f1
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>>> for details.
> > > > > > >> >> >> > >> >>>> > > > >>>> In general I don't see many pauses
> in
> > > the
> > > > > GC.
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>>> What is more interesting, 1 hour
> > after
> > > > the
> > > > > > 1st
> > > > > > >> >> >> > >> regionserver
> > > > > > >> >> >> > >> >>>> > > > >>>> disconnected,
> > > > > > >> >> >> > >> >>>> > > > >>>> the master log complained about too
> > > many
> > > > > open
> > > > > > >> >> files.
> > > > > > >> >> >> > This
> > > > > > >> >> >> > >> >>>> didn't
> > > > > > >> >> >> > >> >>>> > > > happen
> > > > > > >> >> >> > >> >>>> > > > >>>> before.
> > > > > > >> >> >> > >> >>>> > > > >>>> I checked the system OS setup as
> well
> > > as
> > > > > the
> > > > > > >> >> >> > limits.conf.
> > > > > > >> >> >> > >> I
> > > > > > >> >> >> > >> >>>> also
> > > > > > >> >> >> > >> >>>> > > > checked
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>> the
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>> running processes. The total open
> > files
> > > > > don't
> > > > > > >> >> reach
> > > > > > >> >> >> the
> > > > > > >> >> >> > >> limit.
> > > > > > >> >> >> > >> >>>> I
> > > > > > >> >> >> > >> >>>> > am
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>> confused
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>> a bit.
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>>> Please see the following master
> log.
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:02,114 INFO
> > > > > > >> >> >> > >> >>>> > > >
> > org.apache.hadoop.hbase.master.BaseScanner:
> > > > > > >> >> >> > >> >>>> > > > >>>> RegionManager.metaScanner scan of
> > 4658
> > > > > row(s)
> > > > > > >> of
> > > > > > >> >> meta
> > > > > > >> >> >> > >> region
> > > > > > >> >> >> > >> >>>> > > {server:
> > > > > > >> >> >> > >> >>>> > > > >>>> 192.168.100.128:60021, regionname:
> > > > > > .META.,,1,
> > > > > > >> >> >> startKey:
> > > > > > >> >> >> > >> <>}
> > > > > > >> >> >> > >> >>>> > > complete
> > > > > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:02,114 INFO
> > > > > > >> >> >> > >> >>>> > > >
> > org.apache.hadoop.hbase.master.BaseScanner:
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>> All
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>> 1 .META. region(s) scanned
> > > > > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:07,677 DEBUG
> > > > > > >> >> >> > >> org.apache.zookeeper.ClientCnxn:
> > > > > > >> >> >> > >> >>>> Got
> > > > > > >> >> >> > >> >>>> > > > ping
> > > > > > >> >> >> > >> >>>> > > > >>>> response for
> > > sessionid:0x424eebf1c10004c
> > > > > > after
> > > > > > >> 3ms
> > > > > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:08,178 INFO
> > > > > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > > > > >> >> >> > >> >>>> > > > Exception
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>> in
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>> createBlockOutputStream
> > > > > java.io.IOException:
> > > > > > >> Bad
> > > > > > >> >> >> connect
> > > > > > >> >> >> > >> ack
> > > > > > >> >> >> > >> >>>> with
> > > > > > >> >> >> > >> >>>> > > > >>>> firstBadLink 192.168.100.123:50010
> > > > > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:08,178 INFO
> > > > > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > > > > >> >> >> > >> >>>> > > > >>>> Abandoning
> > > > > > >> >> >> > >> >>>> > > > >>>> block
> > blk_-2808245019291145247_5478039
> > > > > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:09,682 INFO
> > > > > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > > > > >> >> >> > >> >>>> > > > Exception
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>> in
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>> createBlockOutputStream
> > > > > java.io.EOFException
> > > > > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:09,682 INFO
> > > > > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > > > > >> >> >> > >> >>>> > > > >>>> Abandoning
> > > > > > >> >> >> > >> >>>> > > > >>>> block
> blk_1074853606841896259_5478048
> > > > > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:10,334 DEBUG
> > > > > > >> >> >> > >> org.apache.zookeeper.ClientCnxn:
> > > > > > >> >> >> > >> >>>> Got
> > > > > > >> >> >> > >> >>>> > > > ping
> > > > > > >> >> >> > >> >>>> > > > >>>> response for
> > sessionid:0x24eebf1043003c
> > > > > after
> > > > > > >> 1ms
> > > > > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:21,018 DEBUG
> > > > > > >> >> >> > >> org.apache.zookeeper.ClientCnxn:
> > > > > > >> >> >> > >> >>>> Got
> > > > > > >> >> >> > >> >>>> > > > ping
> > > > > > >> >> >> > >> >>>> > > > >>>> response for
> > > sessionid:0x424eebf1c10004c
> > > > > > after
> > > > > > >> 0ms
> > > > > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:23,674 DEBUG
> > > > > > >> >> >> > >> org.apache.zookeeper.ClientCnxn:
> > > > > > >> >> >> > >> >>>> Got
> > > > > > >> >> >> > >> >>>> > > > ping
> > > > > > >> >> >> > >> >>>> > > > >>>> response for
> > sessionid:0x24eebf1043003c
> > > > > after
> > > > > > >> 0ms
> > > > > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:24,828 INFO
> > > > > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > > > > >> >> >> > >> >>>> > > > Exception
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>> in
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>> createBlockOutputStream
> > > > > java.io.IOException:
> > > > > > >> Bad
> > > > > > >> >> >> connect
> > > > > > >> >> >> > >> ack
> > > > > > >> >> >> > >> >>>> with
> > > > > > >> >> >> > >> >>>> > > > >>>> firstBadLink 192.168.100.123:50010
> > > > > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:24,828 INFO
> > > > > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > > > > >> >> >> > >> >>>> > > > >>>> Abandoning
> > > > > > >> >> >> > >> >>>> > > > >>>> block
> > blk_-6642544517082142289_5478063
> > > > > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:24,828 INFO
> > > > > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > > > > >> >> >> > >> >>>> > > > Exception
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>> in
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>> createBlockOutputStream
> > > > > > >> java.net.SocketException:
> > > > > > >> >> Too
> > > > > > >> >> >> > many
> > > > > > >> >> >> > >> >>>> open
> > > > > > >> >> >> > >> >>>> > > files
> > > > > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:24,828 INFO
> > > > > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > > > > >> >> >> > >> >>>> > > > >>>> Abandoning
> > > > > > >> >> >> > >> >>>> > > > >>>> block
> blk_2057511041109796090_5478063
> > > > > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:24,928 INFO
> > > > > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > > > > >> >> >> > >> >>>> > > > Exception
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>> in
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>> createBlockOutputStream
> > > > > > >> java.net.SocketException:
> > > > > > >> >> Too
> > > > > > >> >> >> > many
> > > > > > >> >> >> > >> >>>> open
> > > > > > >> >> >> > >> >>>> > > files
> > > > > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:24,928 INFO
> > > > > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > > > > >> >> >> > >> >>>> > > > >>>> Abandoning
> > > > > > >> >> >> > >> >>>> > > > >>>> block
> blk_8219260302213892894_5478064
> > > > > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:30,855 INFO
> > > > > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > > > > >> >> >> > >> >>>> > > > Exception
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>> in
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>> createBlockOutputStream
> > > > > > >> java.net.SocketException:
> > > > > > >> >> Too
> > > > > > >> >> >> > many
> > > > > > >> >> >> > >> >>>> open
> > > > > > >> >> >> > >> >>>> > > files
> > > > > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:30,855 INFO
> > > > > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > > > > >> >> >> > >> >>>> > > > >>>> Abandoning
> > > > > > >> >> >> > >> >>>> > > > >>>> block
> blk_1669205542853067709_5478235
> > > > > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:30,905 INFO
> > > > > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > > > > >> >> >> > >> >>>> > > > Exception
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>> in
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>> createBlockOutputStream
> > > > > > >> java.net.SocketException:
> > > > > > >> >> Too
> > > > > > >> >> >> > many
> > > > > > >> >> >> > >> >>>> open
> > > > > > >> >> >> > >> >>>> > > files
> > > > > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:30,905 INFO
> > > > > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > > > > >> >> >> > >> >>>> > > > >>>> Abandoning
> > > > > > >> >> >> > >> >>>> > > > >>>> block
> blk_9128897691346270351_5478237
> > > > > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:30,955 INFO
> > > > > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > > > > >> >> >> > >> >>>> > > > Exception
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>> in
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>> createBlockOutputStream
> > > > > > >> java.net.SocketException:
> > > > > > >> >> Too
> > > > > > >> >> >> > many
> > > > > > >> >> >> > >> >>>> open
> > > > > > >> >> >> > >> >>>> > > files
> > > > > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:30,955 INFO
> > > > > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > > > > >> >> >> > >> >>>> > > > >>>> Abandoning
> > > > > > >> >> >> > >> >>>> > > > >>>> block
> blk_1116845144864123018_5478240
> > > > > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:34,372 DEBUG
> > > > > > >> >> >> > >> org.apache.zookeeper.ClientCnxn:
> > > > > > >> >> >> > >> >>>> Got
> > > > > > >> >> >> > >> >>>> > > > ping
> > > > > > >> >> >> > >> >>>> > > > >>>> response for
> > > sessionid:0x424eebf1c10004c
> > > > > > after
> > > > > > >> 0ms
> > > > > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:37,034 DEBUG
> > > > > > >> >> >> > >> org.apache.zookeeper.ClientCnxn:
> > > > > > >> >> >> > >> >>>> Got
> > > > > > >> >> >> > >> >>>> > > > ping
> > > > > > >> >> >> > >> >>>> > > > >>>> response for
> > sessionid:0x24eebf1043003c
> > > > > after
> > > > > > >> 0ms
> > > > > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:37,235 WARN
> > > > > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>> DataStreamer
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>> Exception: java.io.IOException: Too
> > > many
> > > > > open
> > > > > > >> >> files
> > > > > > >> >> >> > >> >>>> > > > >>>>      at
> > > sun.nio.ch.IOUtil.initPipe(Native
> > > > > > >> Method)
> > > > > > >> >> >> > >> >>>> > > > >>>>      at
> > > > > > >> >> >> > >> >>>> > >
> > > > > > >> >> >>
> > > sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:49)
> > > > > > >> >> >> > >> >>>> > > > >>>>      at
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > >
> > > > > > >> >> >> > >> >>>> > >
> > > > > > >> >> >> > >> >>>> >
> > > > > > >> >> >> > >> >>>>
> > > > > > >> >> >> > >>
> > > > > > >> >> >> >
> > > > > > >> >> >>
> > > > > > >> >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:18)
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>>      at
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > >
> > > > > > >> >> >> > >> >>>> > >
> > > > > > >> >> >> > >> >>>> >
> > > > > > >> >> >> > >> >>>>
> > > > > > >> >> >> > >>
> > > > > > >> >> >> >
> > > > > > >> >> >>
> > > > > > >> >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get(SocketIOWithTimeout.java:407)
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>>      at
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > >
> > > > > > >> >> >> > >> >>>> > >
> > > > > > >> >> >> > >> >>>> >
> > > > > > >> >> >> > >> >>>>
> > > > > > >> >> >> > >>
> > > > > > >> >> >> >
> > > > > > >> >> >>
> > > > > > >> >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:322)
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>>      at
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > >
> > > > > > >> >> >> > >> >>>> > >
> > > > > > >> >> >> > >> >>>> >
> > > > > > >> >> >> > >> >>>>
> > > > > > >> >> >> > >>
> > > > > > >> >> >> >
> > > > > > >> >> >>
> > > > > > >> >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>>      at
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > >
> > > > > > >> >> >> > >> >>>> > >
> > > > > > >> >> >> > >> >>>> >
> > > > > > >> >> >> > >> >>>>
> > > > > > >> >> >> > >>
> > > > > > >> >> >> >
> > > > > > >> >> >>
> > > > > > >> >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>>      at
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > >
> > > > > > >> >> >> > >> >>>> > >
> > > > > > >> >> >> > >> >>>> >
> > > > > > >> >> >> > >> >>>>
> > > > > > >> >> >> > >>
> > > > > > >> >> >> >
> > > > > > >> >> >>
> > > > > > >> >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>>      at
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>>
> > > > > > >> >> >>
> > > > > java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>>      at
> > > > > > >> >> >> > >> >>>>
> > > > > java.io.DataOutputStream.write(DataOutputStream.java:90)
> > > > > > >> >> >> > >> >>>> > > > >>>>      at
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > >
> > > > > > >> >> >> > >> >>>> > >
> > > > > > >> >> >> > >> >>>> >
> > > > > > >> >> >> > >> >>>>
> > > > > > >> >> >> > >>
> > > > > > >> >> >> >
> > > > > > >> >> >>
> > > > > > >> >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2290)
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:37,235 WARN
> > > > > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > > > > >> >> >> > >> >>>> > Error
> > > > > > >> >> >> > >> >>>> > > > >>>> Recovery for block
> > > > > > >> blk_8148813491785406356_5478475
> > > > > > >> >> >> bad
> > > > > > >> >> >> > >> >>>> datanode[0]
> > > > > > >> >> >> > >> >>>> > > > >>>> 192.168.100.123:50010
> > > > > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:37,235 WARN
> > > > > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > > > > >> >> >> > >> >>>> > Error
> > > > > > >> >> >> > >> >>>> > > > >>>> Recovery for block
> > > > > > >> blk_8148813491785406356_5478475
> > > > > > >> >> in
> > > > > > >> >> >> > >> pipeline
> > > > > > >> >> >> > >> >>>> > > > >>>> 192.168.100.123:50010,
> > > > > 192.168.100.134:50010
> > > > > > ,
> > > > > > >> >> >> > >> >>>> > 192.168.100.122:50010
> > > > > > >> >> >> > >> >>>> > > :
> > > > > > >> >> >> > >> >>>> > > > >>>> bad
> > > > > > >> >> >> > >> >>>> > > > >>>> datanode 192.168.100.123:50010
> > > > > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:37,436 INFO
> > > > > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > > > > >> >> >> > >> >>>> > > > Exception
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>> in
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>> createBlockOutputStream
> > > > > > >> java.net.SocketException:
> > > > > > >> >> Too
> > > > > > >> >> >> > many
> > > > > > >> >> >> > >> >>>> open
> > > > > > >> >> >> > >> >>>> > > files
> > > > > > >> >> >> > >> >>>> > > > >>>> 2009-11-13 20:06:37,436 INFO
> > > > > > >> >> >> > >> org.apache.hadoop.hdfs.DFSClient:
> > > > > > >> >> >> > >> >>>> > > > >>>> Abandoning
> > > > > > >> >> >> > >> >>>> > > > >>>> block
> blk_2119727700857186236_5478498
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>>> On Thu, Nov 12, 2009 at 4:21 PM,
> > Zhenyu
> > > > > Zhong
> > > > > > <
> > > > > > >> >> >> > >> >>>> > > > zhongresearch@gmail.com
> > > > > > >> >> >> > >> >>>> > > > >>>> wrote:
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>  Will do.
> > > > > > >> >> >> > >> >>>> > > > >>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>> thanks
> > > > > > >> >> >> > >> >>>> > > > >>>>> zhenyu
> > > > > > >> >> >> > >> >>>> > > > >>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>> On Thu, Nov 12, 2009 at 3:33 PM,
> > stack
> > > <
> > > > > > >> >> >> > stack@duboce.net
> > > > > > >> >> >> > >> >
> > > > > > >> >> >> > >> >>>> wrote:
> > > > > > >> >> >> > >> >>>> > > > >>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>  Enable GC logging too on this
> next
> > > run
> > > > > (see
> > > > > > >> >> >> > >> hbase-env.sh).
> > > > > > >> >> >> > >> >>>>  Lets
> > > > > > >> >> >> > >> >>>> > > try
> > > > > > >> >> >> > >> >>>> > > > >>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>> and
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>> get
> > > > > > >> >> >> > >> >>>> > > > >>>>>> to the bottom of whats going on.
> > > > > > >> >> >> > >> >>>> > > > >>>>>> Thanks,
> > > > > > >> >> >> > >> >>>> > > > >>>>>> St.Ack
> > > > > > >> >> >> > >> >>>> > > > >>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>> On Thu, Nov 12, 2009 at 12:29 PM,
> > > > Zhenyu
> > > > > > >> Zhong <
> > > > > > >> >> >> > >> >>>> > > > >>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>> zhongresearch@gmail.com
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>> wrote:
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>  I can switch the boxes that run
> > > > > zookeeper
> > > > > > >> with
> > > > > > >> >> the
> > > > > > >> >> >> > ones
> > > > > > >> >> >> > >> >>>> that
> > > > > > >> >> >> > >> >>>> > run
> > > > > > >> >> >> > >> >>>> > > > >>>>>>> regionservers.
> > > > > > >> >> >> > >> >>>> > > > >>>>>>> I will see the results later.
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>> FYI. The node does have the 10
> > > minutes
> > > > > > >> >> >> > >> >>>> > zookeeper.session.timeout
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>> value
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>> in
> > > > > > >> >> >> > >> >>>> > > > >>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>> place.
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>> thanks
> > > > > > >> >> >> > >> >>>> > > > >>>>>>> zhenyu
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>> On Thu, Nov 12, 2009 at 3:21 PM,
> > > stack
> > > > <
> > > > > > >> >> >> > >> stack@duboce.net>
> > > > > > >> >> >> > >> >>>> > wrote:
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>  On Thu, Nov 12, 2009 at 11:50
> AM,
> > > > > Zhenyu
> > > > > > >> Zhong
> > > > > > >> >> <
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>> zhongresearch@gmail.com
> > > > > > >> >> >> > >> >>>> > > > >>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>> wrote:
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> In my cluster, half of the
> > cluster
> > > > > have
> > > > > > 2
> > > > > > >> >> disks
> > > > > > >> >> >> > 400GB
> > > > > > >> >> >> > >> >>>> each
> > > > > > >> >> >> > >> >>>> > per
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>> machine,
> > > > > > >> >> >> > >> >>>> > > > >>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>> and
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> half of the cluster have 6
> disks
> > > per
> > > > > > >> machine.
> > > > > > >> >> >> >  Maybe
> > > > > > >> >> >> > >> we
> > > > > > >> >> >> > >> >>>> > should
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>> run
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>> zookeeper
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> on the machines with 2 disks
> and
> > > RS
> > > > on
> > > > > > >> >> machines
> > > > > > >> >> >> > with
> > > > > > >> >> >> > >> 6
> > > > > > >> >> >> > >> >>>> disks?
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> That would make most sense
> only
> > in
> > > > the
> > > > > > >> below,
> > > > > > >> >> it
> > > > > > >> >> >> > >> looks
> > > > > > >> >> >> > >> >>>> like
> > > > > > >> >> >> > >> >>>> > the
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>> RS
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>> that
> > > > > > >> >> >> > >> >>>> > > > >>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>> had
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>> issue had 4 disks?
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>  BTW, the 10 minutes
> > > > > > >> zookeeper.session.timeout
> > > > > > >> >> >> has
> > > > > > >> >> >> > >> been
> > > > > > >> >> >> > >> >>>> set
> > > > > > >> >> >> > >> >>>> > > during
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>> the
> > > > > > >> >> >> > >> >>>> > > > >>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>> experiment.
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>> And for sure this node had it
> in
> > > > place?
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>> St.Ack
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> thanks
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> zhenyu
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> On Thu, Nov 12, 2009 at 2:08
> PM,
> > > > stack
> > > > > <
> > > > > > >> >> >> > >> stack@duboce.net
> > > > > > >> >> >> > >> >>>> >
> > > > > > >> >> >> > >> >>>> > > wrote:
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>  On Thu, Nov 12, 2009 at 8:40
> > AM,
> > > > > Zhenyu
> > > > > > >> >> Zhong <
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> zhongresearch@gmail.com
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>> wrote:
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>  Though I experienced 2
> > > > regionserver
> > > > > > >> >> >> disconnection
> > > > > > >> >> >> > >> this
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>> morning,
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>> it
> > > > > > >> >> >> > >> >>>> > > > >>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>> looks
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>> better from the regionserver
> > log.
> > > > > > (please
> > > > > > >> >> see
> > > > > > >> >> >> the
> > > > > > >> >> >> > >> >>>> following
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>> log)
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> http://pastebin.com/m496dbfae
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>> I checked diskIO for one of
> > the
> > > > > > >> >> >> > >> >>>> > regionserver(192.168.100.116)
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>> during
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>> the
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>> time it disconnected.
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>> Time: 03:04:58 AM
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>> Device:            tps
> > > > Blk_read/s
> > > > > > >> >> >> Blk_wrtn/s
> > > > > > >> >> >> > >> >>>> Blk_read
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>> Blk_wrtn
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> sda             105.31
> > >  5458.83
> > > > > > >> >> 19503.64
> > > > > > >> >> >> > >> >>>> 9043873239
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>> 32312473676
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> sda1              2.90
> > > 6.64
> > > > > > >> >>  99.25
> > > > > > >> >> >> > >> >>>> 10993934
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>> 164433464
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> sda2              1.72
> > >  23.76
> > > > > > >> >>  30.25
> > > > > > >> >> >> > >> >>>> 39365817
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>> 50124008
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> sda3              0.30
> > > 0.38
> > > > > > >> >> 3.58
> > > > > > >> >> >> > >> >>>> 624930
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>> 5923000
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> sda4            100.39
> > >  5428.06
> > > > > > >> >> 19370.56
> > > > > > >> >> >> > >> >>>> 8992888270
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>> 32091993204
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>  Is this high for you?  20k
> > > > > > >> blocks/second
> > > > > > >> >> >> would
> > > > > > >> >> >> > >> seem to
> > > > > > >> >> >> > >> >>>> be
> > > > > > >> >> >> > >> >>>> > > high
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> but
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>> its
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>> one
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>> disk only and its not being
> > > shared
> > > > by
> > > > > > zk
> > > > > > >> >> >> anymore
> > > > > > >> >> >> > so
> > > > > > >> >> >> > >> >>>> > shouldn't
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> matter?
> > > > > > >> >> >> > >> >>>> > > > >>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>  I also checked the zookeeper
> > > > quorum
> > > > > > >> server
> > > > > > >> >> >> that
> > > > > > >> >> >> > the
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>> regionserver
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>> tried
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> to
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>> connect according to the log.
> > > > > However,
> > > > > > I
> > > > > > >> >> don't
> > > > > > >> >> >> see
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>> 192.168.100.116
> > > > > > >> >> >> > >> >>>> > > > >>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>> in
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>> the
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>> client list of the zookeeper
> > > quorum
> > > > > > >> server
> > > > > > >> >> that
> > > > > > >> >> >> > the
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>> regionserver
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>> tried
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> to
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>> connect.
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>> Would that be a problem?
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>  Is that because the
> ephemeral
> > > > node
> > > > > > for
> > > > > > >> the
> > > > > > >> >> >> > >> >>>> regionserver
> > > > > > >> >> >> > >> >>>> > had
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> evaporated?
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>> Lost
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>> its lease w/ zk by the time
> you
> > > > went
> > > > > to
> > > > > > >> >> look?
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>  Thu Nov 12 15:04:52 UTC 2009
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>> Zookeeper version:
> > 3.2.1-808558,
> > > > > built
> > > > > > >> on
> > > > > > >> >> >> > >> 08/27/2009
> > > > > > >> >> >> > >> >>>> 18:48
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>> GMT
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>> Clients:
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.127:43045
> > > > > > >> >> >> > >> [1](queued=0,recved=26,sent=0)
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.131:39091
> > > > > > >> >> >> > >> [1](queued=0,recved=964,sent=0)
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.124:35961
> > > > > > >> >> >> > >> [1](queued=0,recved=3266,sent=0)
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.123:47935
> > > > > > >> >> >> > >> [1](queued=0,recved=233,sent=0)
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.125:46931
> > > > > > >> >> >> > >> [1](queued=0,recved=2,sent=0)
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.118:54924
> > > > > > >> >> >> > >> [1](queued=0,recved=295,sent=0)
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.118:41390
> > > > > > >> >> >> > >> [1](queued=0,recved=2290,sent=0)
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>> /192.168.100.136:42243
> > > > > > >> >> >> > >> [1](queued=0,recved=0,sent=0)
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>> Latency min/avg/max:
> 0/17/6333
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>> Received: 47111
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>> Sent: 0
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>> Outstanding: 0
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>> Zxid: 0x77000083f4
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>> Mode: leader
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>> Node count: 23
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>> That 6 second maximum latency
> > is
> > > > > pretty
> > > > > > >> bad
> > > > > > >> >> but
> > > > > > >> >> >> > >> should
> > > > > > >> >> >> > >> >>>> be
> > > > > > >> >> >> > >> >>>> > well
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> within
> > > > > > >> >> >> > >> >>>> > > > >>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>> the
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> zk
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>> session timeout.
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>> So, problem is likely on the
> zk
> > > > > client
> > > > > > >> side
> > > > > > >> >> of
> > > > > > >> >> >> the
> > > > > > >> >> >> > >> >>>> session;
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> i.e.
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>> in
> > > > > > >> >> >> > >> >>>> > > > >>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>> the
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>> RS.
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>> You could enable GC logging
> as
> > > J-D
> > > > > > >> suggested
> > > > > > >> >> to
> > > > > > >> >> >> > see
> > > > > > >> >> >> > >> if
> > > > > > >> >> >> > >> >>>> you
> > > > > > >> >> >> > >> >>>> > > have
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> any
> > > > > > >> >> >> > >> >>>> > > > >>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>> big
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>> pauses, pauses > zk session
> > > timeout.
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>> When the RS went down, it
> > didn't
> > > > look
> > > > > > too
> > > > > > >> >> >> heavily
> > > > > > >> >> >> > >> >>>> loaded:
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>  1. 2009-11-12 15:04:52,830
> > INFO
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> >  org.apache.hadoop.hbase.regionserver.HRegionServer:
> > > > > > >> >> >> > >> >>>> Dump of
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> metrics:
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>  request=1.5166667,
> regions=322,
> > > > > > >> stores=657,
> > > > > > >> >> >> > >> >>>> storefiles=631,
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>  storefileIndexSize=61,
> > > > > > >> memstoreSize=1472,
> > > > > > >> >> >> > >> >>>> usedHeap=2819,
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> maxHeap=4079,
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>  blockCacheSize=658110960,
> > > > > > >> >> >> > blockCacheFree=197395984,
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> blockCacheCount=9903,
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>  blockCacheHitRatio=99
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>> Its serving a few reads?  The
> > > > number
> > > > > of
> > > > > > >> >> store
> > > > > > >> >> >> > files
> > > > > > >> >> >> > >> >>>> seems
> > > > > > >> >> >> > >> >>>> > > fine.
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> Not
> > > > > > >> >> >> > >> >>>> > > > >>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>> too
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> much memory used.
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>> Looking at the logs, I see
> the
> > > > Lease
> > > > > > >> Still
> > > > > > >> >> Held
> > > > > > >> >> >> > >> >>>> exception.
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> This
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>> happens
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> when the RS does its regular
> > > report
> > > > to
> > > > > > the
> > > > > > >> >> >> master
> > > > > > >> >> >> > but
> > > > > > >> >> >> > >> the
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> master
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>> thinks
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>> the
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>> RS has since restarted.
>  It'll
> > > > think
> > > > > > this
> > > > > > >> >> >> probably
> > > > > > >> >> >> > >> >>>> because
> > > > > > >> >> >> > >> >>>> > it
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> noticed
> > > > > > >> >> >> > >> >>>> > > > >>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>> that
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>> the RS's znode in zk had gone
> > > away
> > > > > and
> > > > > > it
> > > > > > >> >> >> > considered
> > > > > > >> >> >> > >> the
> > > > > > >> >> >> > >> >>>> RS
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> dead.
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>> Looking too at your  logs I
> see
> > > > this
> > > > > > gap
> > > > > > >> in
> > > > > > >> >> the
> > > > > > >> >> >> zk
> > > > > > >> >> >> > >> >>>> pinging:
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>  1. 2009-11-12 15:03:39,325
> > DEBUG
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>
> org.apache.zookeeper.ClientCnxn:
> > > > > > >> >> >> > >> >>>> > > > >>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>> Got
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>  ping response for
> > > > > > >> sessionid:0x224e55436ad0004
> > > > > > >> >> >> after
> > > > > > >> >> >> > >> 0ms
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>  2. 2009-11-12 15:03:43,113
> > DEBUG
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>
> org.apache.zookeeper.ClientCnxn:
> > > > > > >> >> >> > >> >>>> > > > >>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>> Got
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>  ping response for
> > > > > > >> sessionid:0x24e55436a0007d
> > > > > > >> >> >> after
> > > > > > >> >> >> > >> 0ms
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>> Where in the lines above it,
> > its
> > > > > > >> reporting
> > > > > > >> >> >> about
> > > > > > >> >> >> > >> every
> > > > > > >> >> >> > >> >>>> ten
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> seconds,
> > > > > > >> >> >> > >> >>>> > > > >>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>> here
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> there is a big gap.
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>> Do you have ganglia or
> > something
> > > > that
> > > > > > >> will
> > > > > > >> >> let
> > > > > > >> >> >> you
> > > > > > >> >> >> > >> look
> > > > > > >> >> >> > >> >>>> more
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> into
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>> what
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>> was
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>> happening on this machine
> > around
> > > > this
> > > > > > >> time?
> > > > > > >> >>  Is
> > > > > > >> >> >> > the
> > > > > > >> >> >> > >> >>>> machine
> > > > > > >> >> >> > >> >>>> > > OK?
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> It
> > > > > > >> >> >> > >> >>>> > > > >>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>> looks
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> lightly loaded and you have
> your
> > > > > cluster
> > > > > > >> >> nicely
> > > > > > >> >> >> > laid
> > > > > > >> >> >> > >> out.
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> Something
> > > > > > >> >> >> > >> >>>> > > > >>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>> odd
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> is
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>> going on.  What about things
> > like
> > > > the
> > > > > > >> write
> > > > > > >> >> >> speed
> > > > > > >> >> >> > to
> > > > > > >> >> >> > >> >>>> disk?
> > > > > > >> >> >> > >> >>>> >  In
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> the
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>> past
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>> strange issues have been
> > explained
> > > by
> > > > > > >> >> incorrectly
> > > > > > >> >> >> > set
> > > > > > >> >> >> > >> BIOS
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> which
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>> made
> > > > > > >> >> >> > >> >>>> > > > >>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>> disks
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>> run at 1/100th of their
> proper
> > > > speed.
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>> St.Ack
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>> Best,
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>> zhenyu
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>> On Wed, Nov 11, 2009 at 3:58
> > PM,
> > > > > > Zhenyu
> > > > > > >> >> Zhong
> > > > > > >> >> >> <
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>> zhongresearch@gmail.com
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> wrote:
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>  Stack
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>> I am very appreciated for
> > your
> > > > > > >> comments.
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>> I will use the zookeeper
> > > > monitoring
> > > > > > >> script
> > > > > > >> >> on
> > > > > > >> >> >> my
> > > > > > >> >> >> > >> >>>> cluster
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>> and
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>> let
> > > > > > >> >> >> > >> >>>> > > > >>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>> it
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>> run
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>> overnight to see the result.
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>> I will follow up the thread
> > > when
> > > > I
> > > > > > get
> > > > > > >> >> >> anything.
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>> thanks
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>> zhenyu
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>> On Wed, Nov 11, 2009 at
> 3:52
> > > PM,
> > > > > > stack
> > > > > > >> <
> > > > > > >> >> >> > >> >>>> stack@duboce.net>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>> wrote:
> > > > > > >> >> >> > >> >>>> > > > >>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>  I see these in your log
> too:
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>>  1. 2009-11-11
> 04:27:19,018
> > > > DEBUG
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>
> > > org.apache.zookeeper.ClientCnxn:
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> Got
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>  ping response for
> > > > > > >> >> sessionid:0x424dfd908c50009
> > > > > > >> >> >> > >> after
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>> 4544ms
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>>  2. 2009-11-11 04:27:19,018 DEBUG
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>
> > > org.apache.zookeeper.ClientCnxn:
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> Got
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>  ping response for
> > > > > > >> >> sessionid:0x24dfd90c810002
> > > > > > >> >> >> > after
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>> 4548ms
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>>  3. 2009-11-11 04:27:43,960 DEBUG
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>
> > > org.apache.zookeeper.ClientCnxn:
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>> Got
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>  ping response for
> > > > > > >> >> sessionid:0x424dfd908c50009
> > > > > > >> >> >> > >> after
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>> 9030ms
> > > > > > >> >> >> > >> >>>> > > > >>>
> > > > > > >> >> >> > >> >>>> > > > >>>>  4. 2009-11-11 04:27:43,960 DEBUG
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>>>>> org.apache.zookeeper.
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>>
> > > > > > >> >> >> > >> >>>> > > > >>>>>>>
> > > > > > >> >> >> > >> >>>> > > >
> > > > > > >> >> >> > >> >>>> > >
> > > > > > >> >> >> > >> >>>> >
> > > > > > >> >> >> > >> >>>>
> > > > > > >> >> >> > >> >>>
> > > > > > >> >> >> > >> >>>
> > > > > > >> >> >> > >> >>
> > > > > > >> >> >> > >> >
> > > > > > >> >> >> > >>
> > > > > > >> >> >> > >
> > > > > > >> >> >> > >
> > > > > > >> >> >> >
> > > > > > >> >> >>
> > > > > > >> >> >
> > > > > > >> >> >
> > > > > > >> >>
> > > > > > >> >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message