hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: Errors after major compaction
Date Thu, 07 Jul 2011 05:52:20 GMT
On Sun, Jul 3, 2011 at 12:02 PM, Eran Kutner <eran@gigya.com> wrote:
> 4. Then at 16:40:00 the master log says: master:60000-0x13004a31d7804c4
> Creating (or updating) unassigned node for 584dac5cc70d8682f71c4675a843c3
> 09 with OFFLINE state - why did it decide to take the region offline after
> learning it was successfully opened?


My guess is that though we'd opened the region, the timeout of regions
in transition expired and it we queued assigning it elsewhere (The
first step in assigning a region elsewhere is putting the regions
znode into the OFFLINE state).  Mind pastebin'ing this part of master
log?

The issues Ted cites and the fix racyness issue I added to it are
about cutting down the span over which locks are held in the master --
this has made for big improvements in the promptness with which the
master processes state transitions -- and then there are races between
the handling of region transitions -- e.g. opens -- down in the region
transition handlers and the running of the timeout monitor.  These are
whats being addressed.

> 5. Then it tries to reopen the region on hadoop1-s05, which indicates in its
> log that the open request failed because the region was already open - why
> didn't the master use that information to learn that the region was already
> open?

It looks like we log it as WARN on the regionserver side but do
nothing else with it.  Here is the message:

2011-06-29 16:40:01,079 WARN
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
Attempted open of
gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
but already online on this server

We notice we already have it opened down in the open region handler
down in the regionserver.  We've let go of the connection to the
master at this stage so no way of our flagging the master that we
already have this region.  What we should do is before we queue it,
check if we already have it and return the master an
AlreadyOpenException (I made HBASE-4073 to make sure we don't forget
about this one -- the root issue needs addressing but thereafter, we
should never queue the opening of a region we already have opened on
the regionserver)


> 7. Now the master forces the transition of the region to hadoop1-s02 but
> there is no sign of that on hadoop1-s05 - why doesn't the old RS
> (hadoop1-s05) detect that it is no longer the master and relinquishes
> control of the region?
>
Well, the master doesn't know that s05 has the region open -- thats
why it gives it to s02 -- and then, there is no channel available to
s05 to figure who has what.

St.Ack

Mime
View raw message