hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bradford Stephens <bradfordsteph...@gmail.com>
Subject Story of my HBase Bugs / Feature Suggestions
Date Thu, 20 Aug 2009 22:48:17 GMT
Hey there,

I'm sending out this summary of how I diagnosed what was wrong with my
cluster in hopes that you can glean some knowledge/suggestions from it :)
Thanks for the diagnostic footwork.

A few days ago,  I noticed that simple MR jobs I was running against .20-RC2
were failing. Scanners were reaching the end of a region, and then simply
freezing. The only indication I had of this was the Mapper timing out after
1000 seconds -- there were no error messages in the logs for either Hadoop
or HBase.

It turns out that my table was corrupt:

1. Doing a 'GET' from the shell on a row near the end of a region resulted
in an error: "Row not in expected region", or something to that effect. It
re-appeared several times, and I never got the row content.
2. What the Master UI indicated for the region distribution was totally
different from what the RS reported. Row key ranges were on different
servers than the UI knew about, and the nodes reported different start and
end keys for a region than the UI.

I'm not sure how this arose: I noticed after a heavy insert job that when we
tried to shut down our cluster, it took 30 dots and more -- so we manually
killed master. Would that lead to corruption?

I finally resolved the problem by dropping the table and re-loading the data

A few suggestions going forward:
1. More useful scanner error messages: GET reported that there was a problem
finding a certain row, why couldn't Scanner? There wasn't even a timeout or
anything -- it just sat there.
2. A fsck / restore would be useful for HBase. I imagine you can recreate
.META. using .regioninfo and scanning blocks out of HDFS. This would play
nice with the HBase bulk loader story, I suppose.

I'll be happy to work on these in my spare time, if I ever get any ;)


http://www.roadtofailure.com -- The Fringes of Scalability, Social Media,
and Computer Science

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message