hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From elsif <elsif.t...@gmail.com>
Subject Re: HBase Exceptions on version 0.20.1
Date Wed, 21 Oct 2009 15:16:40 GMT

While running the test on this cluster of 14 servers, the highest loads
I see are 3.68 (0.0% wa) on the master node and 2.65 (3.4% wa) on the
node serving the .META. region.  All the machines are on a single
gigabit switch dedicated to the cluster.  The highest throughput between
nodes has been 21.4MBps Rx on the node hosting the .META. region. 

There are 239 "Block blk_-xxx is not valid errors", 522 "BlockInfo not
found in volumeMap" errors, and 208 "BlockAlreadyExistsException" found
in the hadoop logs over 12 hours of running the test.

I understand that I am loading the cluster - that is the point of the
test, but I don't think that this should result in data loss.  Failed
inserts at the client level I can handle, but loss of data that was
previously thought to be stored in hbase is a major issue.  Are there
plans to make hbase more resilient to load based failures?


Andrew Purtell wrote:
> The reason JG points to load as being a problem as all signs point to it: This is usually
the culprit behind DFS "no live block" errors -- the namenode is too busy and/or falling behind,
or the datanodes are falling behind, or actually failing. Also, in the log snippets you provide,
HBase is complaining about writes to DFS (for the WAL) taking in excess of 2 seconds. Also
highly indicative of load, write load. Shortly after this, Zookeeper sessions begin expiring,
which is also usually indicative of overloading -- heartbeats miss their deadline. 
> When I see these signs on my test clusters, I/O wait is generally in excess of 40%. 
> If your total CPU load is really just a few % (user + system + iowait), then I'd suggest
you look at the storage layer. Is there anything in the datanode logs that seems like it might
be relevant?
> What about the network? Gigabit? Any potential sources of contention? Are you tracking
network utilization metrics during the test?
> Also, you might consider using Ganglia to monitor and correlate system metrics and HBase
and HDFS metrics during your testing, if you are not doing this already. 
>    - Andy

View raw message