hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stack <st...@duboce.net>
Subject Re: HBase Exceptions on version 0.20.1
Date Wed, 21 Oct 2009 17:05:09 GMT
On Wed, Oct 21, 2009 at 8:16 AM, elsif <elsif.then@gmail.com> wrote:

> There are 239 "Block blk_-xxx is not valid errors", 522 "BlockInfo not
> found in volumeMap" errors, and 208 "BlockAlreadyExistsException" found
> in the hadoop logs over 12 hours of running the test.

Above are from application-level (hbase) or datanode logs?  If you trace any
of the above -- follow the block name -- in the NN are the blocks lost or do
you see replicas taking over or recoveries triggered?

> I understand that I am loading the cluster - that is the point of the
> test, but I don't think that this should result in data loss.  Failed
> inserts at the client level I can handle, but loss of data that was
> previously thought to be stored in hbase is a major issue.  Are there
> plans to make hbase more resilient to load based failures?
It looks like there'll be data loss going by a few of the exceptions you
provide originally.  Here's a couple of comments:

"No live nodes contain current block"  Usually we see this if the
client-side hadoop has not been patched with hdfs-127/hadoop-4681.  Your
test program doesn't seem to have come across.  Mind attaching it to an
issue so I can try it?  Going by the way you started your test program, you
should have the hbase patched hadoop first in your CLASSPATH so you should
be ok but maybe there is something about your environmnent frustrating
hbase's using a patched hadoop?

"java.io.IOException: TIMED OUT" Your regionserver or master timed out its
zk session.  GC or swapping or disk used by zk is under heavy i/o loading?

"ClosedChannelException" Probably symptom of a RS shutdown because of events
such as above.

"Abandoning block..." Did this write to HLog fail? Its just an INFO level
log out of DFSClient.

"file system not available" What happened before this?  Was this just an
emission on regionserver shutdown?


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message