hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: Errors after major compaction
Date Thu, 07 Jul 2011 18:44:17 GMT
On Thu, Jul 7, 2011 at 2:56 AM, Eran Kutner <eran@gigya.com> wrote:
>> Well, the master doesn't know that s05 has the region open -- thats
>> why it gives it to s02 -- and then, there is no channel available to
>> s05 to figure who has what
> The way I see it, that's the root of the problem.

Well backing up, we have some races in master to fix first.  There is
also a 'hole' in our transitioning of states up in zk that we recently
found.  These fixes should do a lot to mitigate the frequency at which
the issue arises.

Thereafter, we've discussed adding a feedback loop where clients at
least can report "something is off" forcing master to do a
reevaluation (it can ask regionservers what they have and check it
against its in-memory state).

>  It would probably
> make sense if the RS could figure this out independently from the
> master. I don't really see a way to do that other than storing the
> region allocation in a central "reliable" location (read ZK), having
> each RS register itself there when it opens a region and constantly
> monitor the assignment of of the regions it holds, looking for other
> RSs that registered the same region. In which case they can either try
> to work out which one should be the owner of the region or they could
> both close the region and let the master select a new RS. This is
> obviously a rough idea that needs more polishing, like how to handle
> old records of dead servers, but that's the only way I can think of
> for guaranteeing there is no double assignment other than using
> broadcasts and election algorithms.
> I can work out the details if people think it's interesting. There's
> also a discussion about it in HBASE-4060.

Please add any ideas to the issue.  At a minimum we'll have to answer
your proposal with why we think the current design works -- once we've
fixed these recently found bugs.


View raw message