hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Chechik <dmi...@tellapart.com>
Subject region servers crashing
Date Wed, 14 Jul 2010 22:03:52 GMT
Hi all,

We've been having issues for a few days with HBase region servers crashing
when under load from mapreduce jobs.

There are a few different errors in the region server logs - I've attached a
sample log of 4 different region servers crashing within an hour of each
other.

Some details:
- This happens when a full table scan from a mapreduce is in progress.
- We are running HBase 0.20.3, with a 16-slave cluster, on EC2.
- Some of the region server errors are NPEs which look a lot like
https://issues.apache.org/jira/browse/HBASE-2077. I'm not sure if that is
the exact problem or if this issue is fixed in 0.20.5. Is it worth upgrading
to 0.20.5 to fix this?
- Some of the region server errors are scanner lease expired errors:
2010-07-12 15:10:03,299 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
86246ms, ten times longer than scheduled: 1000
2010-07-12 15:10:03,299 WARN org.apache.zookeeper.ClientCnxn: Exception
closing session 0x229c72b89360001 to sun.nio.ch.SelectionKeyImpl@7f712b3a
java.io.IOException: TIMED OUT
        at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
2010-07-12 15:10:03,299 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
1779060682963568676 lease expired
2010-07-12 15:10:03,406 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer:
org.apache.hadoop.hbase.UnknownScannerException: Name: 1779060682963568676
        at
org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1877)
        at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
       at java.lang.reflect.Method.invoke(Method.java:597)
        at
org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)

We tried increasing hbase.regionserver.lease.period to 2 minutes but that
didn't seem to make a difference here.

- Our configuration and table size haven't changed significantly in those
days.
- We're running a 3-node Zookeeper cluster collocated on the same machines
as the HBase/Hadoop cluster.
- Based on Ganglia output, it doesn't look like the regionservers (or any of
the machines) are swapping.
- At the time of the crash, it doesn't appear that the network was
overloaded (i.e. we've seen higher network traffic without crashes). So it
doesn't seem that this is a problem communicating with Zookeeper.
- We have "-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode" enabled, so it
doesn't seem like we should be pausing due to GC too much.

Any thoughts?

Thanks,

- Dmitry

Mime
View raw message