hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hayden Marchant <hayd...@amobee.com>
Subject Orphaned aborted snapshot
Date Thu, 23 Oct 2014 05:57:49 GMT
Hi all,


I am running HBase 0.94.6 on a 20 node cluster, and am taking daily snapshots of our single
table (only keeping snapshots for the last 3 days. Yesterday, I started seeing the following
messages in one of the region servers that had to be restarted:


2014-10-22 08:29:19,982 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC Server handler 1
on 60020: starting
2014-10-22 08:29:19,982 INFO org.apache.hadoop.ipc.HBaseServer: REPL IPC Server handler 2
on 60020: starting
2014-10-22 08:29:19,986 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Serving as
njhdslave40,60020,1413980958234, RPC listening on njhdslave40/172.30.120.180:60020, sessionid=0x2482ca09984b22a
2014-10-22 08:29:19,986 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker
njhdslave40,60020,1413980958234 starting
2014-10-22 08:29:19,988 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Registered
RegionServer MXBean
2014-10-22 08:29:20,024 INFO org.apache.hadoop.hbase.procedure.ProcedureMember: Received abort
on procedure with no local subprocedure upd-2014_10_19, ignoring it.
org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via timer-java.util.Timer@2cef133c:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
org.apache.hadoop.hbase.errorhandling.TimeoutException: Timeout elapsed! Source:Timeout caused
Foreign Exception Start:1413728408736, End:1413728468758, diff:60022, max:60000 ms
    at org.apache.hadoop.hbase.errorhandling.ForeignException.deserialize(ForeignException.java:171)
    at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.abort(ZKProcedureMemberRpcs.java:320)
    at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.watchForAbortedProcedures(ZKProcedureMemberRpcs.java:143)
    at org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.start(ZKProcedureMemberRpcs.java:340)
    at org.apache.hadoop.hbase.regionserver.snapshot.RegionServerSnapshotManager.start(RegionServerSnapshotManager.java:141)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:734)
    at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: org.apache.hadoop.hbase.errorhandling.TimeoutException:
Timeout elapsed! Source:Timeout caused Foreign Exception Start:1413728408736, End:1413728468758,
diff:60022, max:60000 ms
    at org.apache.hadoop.hbase.errorhandling.TimeoutExceptionInjector$1.run(TimeoutExceptionInjector.java:71)
    at java.util.TimerThread.mainLoop(Timer.java:512)
    at java.util.TimerThread.run(Timer.java:462)
2014-10-22 08:29:29,526 WARN org.apache.hadoop.conf.Configuration: hadoop.native.lib is deprecated.
Instead, use io.native.lib.available




After looking at the code, I see that the RegionServerSnapshotManager is watching for aborted
nodes, and reports the exception above. Indeed, a few days ago, we had some issues with one
of the servers, and I guess the creation of the daily snapshot was aborted. Indeed, looking
in the zookeeper node, we see a record of an aborted snapshot from 19 October.



Here is a dump from a  zookeeper node:

[zk: slave:2181 (CONNECTED) 2] ls /hbase/online-snapshot/abort
[upd-2014_10_19]

Just to confirm, I restarted another region server, and saw the same error. It seems that
the cluster is working correctly, and new snapshots are being created. 


My question is,are these error messages expected, and  what process is responsible for automatically
cleaning up the 'abort' node, and are there any orphaned HLogs from the aborted snapshot that
need manual cleaning up.

Thanks,
Hayden




Mime
View raw message