hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oded Rosen <o...@legolas-media.com>
Subject DFSClient errors during massive HBase load
Date Thu, 01 Apr 2010 20:19:11 GMT
**Hi all,

I have a problem with a massive HBase loading job.
It is from raw files to hbase, through some mapreduce processing +
manipulating (so loading direcly to files will not be easy).

After some dozen million successful writes - a few hours of load - some of
the regionservers start to die - one by one - until the whole cluster is
kaput.
The hbase master sees a "znode expired" error each time a regionserver
falls. The regionserver errors are attached.

Current configurations:
Four nodes - one namenode+master, three datanodes+regionservers.
dfs.datanode.max.xcievers: 2047
ulimit: 1024
servers: fedora
hadoop-0.20, hbase-0.20, hdfs (private servers, not on ec2 or anything).


*The specific errors from the regionserver log (from <IP6>, see comment):*

2010-04-01 11:36:00,224 WARN org.apache.hadoop.hdfs.DFSClient:
DFSOutputStream ResponseProcessor exception  for block
blk_7621973847448611459_244908java.io.IOException: Bad response 1 for block
blk_7621973847448611459_244908 from datanode <IP2>:50010
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2423)

*after that, some of this appears:*

2010-04-01 11:36:20,602 INFO org.apache.hadoop.hdfs.DFSClient: Exception in
createBlockOutputStream java.io.IOException: Bad connect ack with
firstBadLink <IP2>:50010
2010-04-01 11:36:20,602 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
block blk_4280490438976631008_245009

*and the FATAL:*

2010-04-01 11:36:32,634 FATAL org.apache.hadoop.hbase.regionserver.HLog:
Could not append. Requesting close of hlog
java.io.IOException: Bad connect ack with firstBadLink <IP2>:50010
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2872)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2795)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2078)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2264)

*this FATAL error appears many times until this one kicks in:*

2010-04-01 11:38:57,281 FATAL
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Replay of hlog
required. Forcing server shutdown
org.apache.hadoop.hbase.DroppedSnapshotException: region: .META.,,1
    at
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:977)
    at
org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:846)
    at
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:241)
    at
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:149)
Caused by: java.io.IOException: Bad connect ack with firstBadLink
<IP2>:50010
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2872)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2795)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2078)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2264)

*(then the regionserver starts closing itself)*

The regionserver on <IP6> was shut down, but problems are corellated with
<IP2> (notice the ip in the error msgs). <IP2> was also considered a dead
node after these errors, according to the hadoop namenode web ui.
I think this is an hdfs failure, rather then hbase/zookeeper (although it is
probably because of hbase high load...).

On the datanodes, once in a while I had:

2010-04-01 11:24:59,265 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(<IP2>:50010,
storageID=DS-1822315410-<IP2>-50010-1266860406782, infoPort=50075,
ipcPort=50020):DataXceiver

but these errors occured at different times, and not even around crashes. No
fatal errors found on the datanode log (but it still crashed).

I haven't seen this exact error on the web (only similar ones);
This guy (http://osdir.com/ml/hbase-user-hadoop-apache/2009-02/msg00186.html)
had a similar problem, but not exactly the same.

Any ideas?
thanks,

-- 
Oded

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message