hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun Ramakrishnan <aramakrish...@languageweaver.com>
Subject RE: region servers crashing
Date Wed, 14 Jul 2010 22:33:22 GMT
Had a problem that caused issues that looked like this.

> 2010-07-12 15:10:03,299 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
> 86246ms, ten times longer than scheduled: 1000

Our problem was with clock skew. We just had to make sure ntp was running on all machines
and also the timezones detected on all the machines were the same.

-----Original Message-----
From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of Jean-Daniel Cryans
Sent: Wednesday, July 14, 2010 3:11 PM
To: user@hbase.apache.org
Subject: Re: region servers crashing


Your log shows this:

> 2010-07-12 15:10:03,299 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
> 86246ms, ten times longer than scheduled: 1000

This is a pause that lasted more than a minute, the process was in
that state (GC, swapping, mix of all of them) for some reason and it
was long enough to expire the ZooKeeper session (since from its point
of view the region server stopped responding).

The NPE is just a side-effect, it is caused by the huge pause.

It's well worth upgrading, but it won't solve your pausing issues. I
can only recommend closer monitoring, setting swappiness to 0 and
giving more memory to HBase (if available).


On Wed, Jul 14, 2010 at 3:03 PM, Dmitry Chechik <dmitry@tellapart.com> wrote:
> Hi all,
> We've been having issues for a few days with HBase region servers crashing
> when under load from mapreduce jobs.
> There are a few different errors in the region server logs - I've attached a
> sample log of 4 different region servers crashing within an hour of each
> other.
> Some details:
> - This happens when a full table scan from a mapreduce is in progress.
> - We are running HBase 0.20.3, with a 16-slave cluster, on EC2.
> - Some of the region server errors are NPEs which look a lot
> like https://issues.apache.org/jira/browse/HBASE-2077. I'm not sure if that
> is the exact problem or if this issue is fixed in 0.20.5. Is it worth
> upgrading to 0.20.5 to fix this?
> - Some of the region server errors are scanner lease expired errors:
> 2010-07-12 15:10:03,299 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
> 86246ms, ten times longer than scheduled: 1000
> 2010-07-12 15:10:03,299 WARN org.apache.zookeeper.ClientCnxn: Exception
> closing session 0x229c72b89360001 to sun.nio.ch.SelectionKeyImpl@7f712b3a
> java.io.IOException: TIMED OUT
>         at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> 2010-07-12 15:10:03,299 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
> 1779060682963568676 lease expired
> 2010-07-12 15:10:03,406 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> org.apache.hadoop.hbase.UnknownScannerException: Name: 1779060682963568676
>         at
> org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1877)
>         at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>         at
> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
> We tried increasing hbase.regionserver.lease.period to 2 minutes but that
> didn't seem to make a difference here.
> - Our configuration and table size haven't changed significantly in those
> days.
> - We're running a 3-node Zookeeper cluster collocated on the same machines
> as the HBase/Hadoop cluster.
> - Based on Ganglia output, it doesn't look like the regionservers (or any of
> the machines) are swapping.
> - At the time of the crash, it doesn't appear that the network was
> overloaded (i.e. we've seen higher network traffic without crashes). So it
> doesn't seem that this is a problem communicating with Zookeeper.
> - We have "-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode" enabled, so it
> doesn't seem like we should be pausing due to GC too much.
> Any thoughts?
> Thanks,
> - Dmitry

View raw message