hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Chechik <dmi...@tellapart.com>
Subject Re: region servers crashing
Date Wed, 14 Jul 2010 23:39:12 GMT
We're running with 1GB of heap space.

Thanks all - we'll look into GC tuning some more.

On Wed, Jul 14, 2010 at 3:47 PM, Jonathan Gray <jgray@facebook.com> wrote:

> This doesn't look like a clock skew issue.
>
> @Dmitry, while you should be running CMS, this is still a garbage collector
> and is still vulnerable to GC pauses.  There are additional configuration
> parameters to tune even more.
>
> How much heap are you running with on your RSs?  If you are hitting your
> servers with lots of load you should run with 4GB or more.
>
> Also, having ZK on the same servers as RS/DN is going to create problems if
> you're already hitting your IO limits.
>
> JG
>
> > -----Original Message-----
> > From: Arun Ramakrishnan [mailto:aramakrishnan@languageweaver.com]
> > Sent: Wednesday, July 14, 2010 3:33 PM
> > To: user@hbase.apache.org
> > Subject: RE: region servers crashing
> >
> > Had a problem that caused issues that looked like this.
> >
> > > 2010-07-12 15:10:03,299 WARN org.apache.hadoop.hbase.util.Sleeper: We
> > slept
> > > 86246ms, ten times longer than scheduled: 1000
> >
> > Our problem was with clock skew. We just had to make sure ntp was
> > running on all machines and also the timezones detected on all the
> > machines were the same.
> >
> > -----Original Message-----
> > From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of Jean-
> > Daniel Cryans
> > Sent: Wednesday, July 14, 2010 3:11 PM
> > To: user@hbase.apache.org
> > Subject: Re: region servers crashing
> >
> > Dmitry,
> >
> > Your log shows this:
> >
> > > 2010-07-12 15:10:03,299 WARN org.apache.hadoop.hbase.util.Sleeper: We
> > slept
> > > 86246ms, ten times longer than scheduled: 1000
> >
> > This is a pause that lasted more than a minute, the process was in
> > that state (GC, swapping, mix of all of them) for some reason and it
> > was long enough to expire the ZooKeeper session (since from its point
> > of view the region server stopped responding).
> >
> > The NPE is just a side-effect, it is caused by the huge pause.
> >
> > It's well worth upgrading, but it won't solve your pausing issues. I
> > can only recommend closer monitoring, setting swappiness to 0 and
> > giving more memory to HBase (if available).
> >
> > J-D
> >
> > On Wed, Jul 14, 2010 at 3:03 PM, Dmitry Chechik <dmitry@tellapart.com>
> > wrote:
> > > Hi all,
> > > We've been having issues for a few days with HBase region servers
> > crashing
> > > when under load from mapreduce jobs.
> > > There are a few different errors in the region server logs - I've
> > attached a
> > > sample log of 4 different region servers crashing within an hour of
> > each
> > > other.
> > > Some details:
> > > - This happens when a full table scan from a mapreduce is in
> > progress.
> > > - We are running HBase 0.20.3, with a 16-slave cluster, on EC2.
> > > - Some of the region server errors are NPEs which look a lot
> > > like https://issues.apache.org/jira/browse/HBASE-2077. I'm not sure
> > if that
> > > is the exact problem or if this issue is fixed in 0.20.5. Is it worth
> > > upgrading to 0.20.5 to fix this?
> > > - Some of the region server errors are scanner lease expired errors:
> > > 2010-07-12 15:10:03,299 WARN org.apache.hadoop.hbase.util.Sleeper: We
> > slept
> > > 86246ms, ten times longer than scheduled: 1000
> > > 2010-07-12 15:10:03,299 WARN org.apache.zookeeper.ClientCnxn:
> > Exception
> > > closing session 0x229c72b89360001 to
> > sun.nio.ch.SelectionKeyImpl@7f712b3a
> > > java.io.IOException: TIMED OUT
> > >         at
> > > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> > > 2010-07-12 15:10:03,299 INFO
> > > org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
> > > 1779060682963568676 lease expired
> > > 2010-07-12 15:10:03,406 ERROR
> > > org.apache.hadoop.hbase.regionserver.HRegionServer:
> > > org.apache.hadoop.hbase.UnknownScannerException: Name:
> > 1779060682963568676
> > >         at
> > >
> > org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.j
> > ava:1877)
> > >         at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown
> > Source)
> > >         at
> > >
> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccesso
> > rImpl.java:25)
> > >        at java.lang.reflect.Method.invoke(Method.java:597)
> > >         at
> > > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
> > > We tried increasing hbase.regionserver.lease.period to 2 minutes but
> > that
> > > didn't seem to make a difference here.
> > > - Our configuration and table size haven't changed significantly in
> > those
> > > days.
> > > - We're running a 3-node Zookeeper cluster collocated on the same
> > machines
> > > as the HBase/Hadoop cluster.
> > > - Based on Ganglia output, it doesn't look like the regionservers (or
> > any of
> > > the machines) are swapping.
> > > - At the time of the crash, it doesn't appear that the network was
> > > overloaded (i.e. we've seen higher network traffic without crashes).
> > So it
> > > doesn't seem that this is a problem communicating with Zookeeper.
> > > - We have "-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode" enabled,
> > so it
> > > doesn't seem like we should be pausing due to GC too much.
> > > Any thoughts?
> > > Thanks,
> > > - Dmitry
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message