hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yannis Pavlidis" <ypavli...@oneriot.com>
Subject unable to access META region after a region server FATAL crash
Date Wed, 21 Oct 2009 00:00:15 GMT

Hi all,

I have encountered a very strange race condition during my testing which results in making
the META region table being not-accessible as it was assigned to a region server which has
been shut down (encountered a FATAL error).

Here is the scenario (using hadoop-0.20.1 and hbase-0.20.0 on a 3 node cluster)

pre condition
===============
cache01 (is the backup master, runs a region server has the root and meta assigned to it)

cache02 (runs a region server)
search01 (runs the master and the region server)

scenario
=========
kill the master on search01

the master on cache01 resumes master duties

cache01 encounters a fatal error (FATAL org.apache.hadoop.hbase.regionserver.LogRoller: Log
rolling failed with ioe) and has to exit

The root is getting re-assigned to the region server on search01 and the meta is getting re-assigned
to the region server on cache02.

Now cache02 encounters the same fatal error (FATAL org.apache.hadoop.hbase.regionserver.LogRoller:
Log rolling failed with ioe) and has to exit before it accepts the assignment for servicing
the meta region

post condition
===============

While the root is assigned to search01 the meta appears to have been left in limbo state (I
think it is still in regionsInTransitions map of the RegionManager). The issue I believe is
because of a race condition.
The region server in cache02 never gets the chance to complete the assignment of the meta
region. When cache01 realizes that cache02 has died in the ProcessServerShutdown it never
checks to see whether the server that died had a meta region assigned to it in transition
(isMetaServer method in the RegionManager checks for that). The result of this is that when
my client connects it gets the cache02 address for the meta server and of course it keeps
failing to connect.

To address this race condition i believe we simply have to check in the closeMetaRegions whether
the deadServer isMetaServer and if it is add the MetaRegion in the list (I had to create a
new method in the RegionManager to return the RegionInfo of the MetaRegion).

I have been unable though to verify my fix since I have been unable to replicate the above
scenario.

Let me know what you guys think. I have attached links to the logs at the end.

Also I would appreciate if you can tell what could have caused the fatal error on the region
servers (I am sure it is clearly something related with me killing master nodes).

Thanks in advance,

=======
master logs on cache01: http://pastebin.com/m61f4893d
regionserver logs on cache01: http://pastebin.com/m56e4302b
regionserver logs on cache02: http://pastebin.com/m11fac0e6
regionserver logs on search01: http://pastebin.com/d667f876c
(For the FATAL errors)
namenode on cache01: http://pastebin.com/dc020387
datanode on cache01: http://pastebin.com/ma25decd

Yannis.

--
Search for the Pulse

Yannis Pavlidis | OneRiot
Softwarist
talk: 720.771.7025
write: ypavlidis@oneriot.com
web: www.oneriot.com



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message