hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oded Rosen <o...@legolas-media.com>
Subject Re: DFSClient errors during massive HBase load
Date Mon, 12 Apr 2010 18:31:45 GMT
The tips you guys gave me made a huge difference.
I also used other tips from the "Troubleshooting" section in hbase wiki, and
from all around the web.
I would like to share my current cluster configuration, as only few places
around the web offer a guided tour of these important configuration changes.
This might be helpful for other people with small clusters, that have
problems with loading large amounts of data to hbase on a regular basis.
I am not a very experienced user (yet...) so if I got something wrong, or if
I am missing anything, please say so. Thanks in advance

*1. Prevent your regionserver machines from memory swap* - this is a must
have, it seems, for small hbase clusters that handle large loads:

*Edit this file (on each regionserver) and then activate the following
commands.*

*File:* /etc/sysctl.conf
*Add Values:*
vm.swappiness = 0 (this one - on datanodes only!)

*Then run (In order to apply the changes immediately):*
sysctl -p /etc/sysctl.conf
service network restart

note: this is a kernel property change. swappiness with a zero value means
the machine will not use virtual memory at all (or at least that what I
understood). So handle with care. a low value (around 5 or 10, from the
maximum value of 100) might also work. My configuration is zero.

(Further explanations:
http://cloudepr.blogspot.com/2009/09/cluster-facilities-hardware-and.html )

*2. Increase file descriptor limit* - this is also a must have for almost
any use of hbase.

*Edit these two files (on each datanode/namenode) and then activate the
following commands.*

*File:* /etc/security/limits.conf
*Add Values:*

hadoop soft nofile 32768
hadoop hard nofile 32768

*File:* /etc/sysctl.conf
*Add Values:*

fs.file-max = 32768

*Then run:*
sysctl -p /etc/sysctl.conf
service network restart
note: you can perform steps 1+2 together, they both edit sysctl.conf. notice
step 1 is only for regionservers (datanodes),
while this one can also be made to the master (namenode) - although I'm not
so sure it's necessary.

(see http://wiki.apache.org/hadoop/Hbase/Troubleshooting<http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A6>)

*3. Raise HDFS + HBASE connection limit upper bound:*

Edit hadoop/hbase configuration files to include these entries:
(you might want to change the specific values, according to your cluster
properties and usage).

*File:* hdfs-site.xml
*Add Properties:*

name: dfs.datanode.max.xcievers
value: 2047

name: dfs.datanode.handler.count
value: 10 (at least as the number of nodes in the cluster, or more if
needed).

(see http://wiki.apache.org/hadoop/Hbase/Troubleshooting<http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A6>
)

*File:* hbase-site.xml
*Add Properties:*

  <property>
    <name>hbase.regionserver.handler.count</name>
    <value>100</value>
  </property>
  <property>
    <name>hbase.zookeeper.property.maxClientCnxns</name>
    <value>100</value>
  </property>

(see http://hadoop.apache.org/hbase/docs/current/api/overview-summary.html<http://hadoop.apache.org/hbase/docs/current/api/overview-summary.html#overview_description>)

If you can remember other changes you've made to increase hbase stability,
you are welcome to reply.

Cheers.

On Thu, Apr 1, 2010 at 11:43 PM, Andrew Purtell <apurtell@apache.org> wrote:

> First,
>
>  "ulimit: 1024"
>
> That's fatal. You need to up file descriptors to something like 32K.
>
> See http://wiki.apache.org/hadoop/Hbase/Troubleshooting, item #6
>
> From there, let's see.
>
>    - Andy
>
> > From: Oded Rosen <oded@legolas-media.com>
> > Subject: DFSClient errors during massive HBase load
> > To: hbase-user@hadoop.apache.org
> > Date: Thursday, April 1, 2010, 1:19 PM
> > **Hi all,
> >
> > I have a problem with a massive HBase loading job.
> > It is from raw files to hbase, through some mapreduce
> > processing +
> > manipulating (so loading direcly to files will not be
> > easy).
> >
> > After some dozen million successful writes - a few hours of
> > load - some of
> > the regionservers start to die - one by one - until the
> > whole cluster is
> > kaput.
> > The hbase master sees a "znode expired" error each time a
> > regionserver
> > falls. The regionserver errors are attached.
> >
> > Current configurations:
> > Four nodes - one namenode+master, three
> > datanodes+regionservers.
> > dfs.datanode.max.xcievers: 2047
> > ulimit: 1024
> > servers: fedora
> > hadoop-0.20, hbase-0.20, hdfs (private servers, not on ec2
> > or anything).
> >
> >
> > *The specific errors from the regionserver log (from
> > <IP6>, see comment):*
> >
> > 2010-04-01 11:36:00,224 WARN
> > org.apache.hadoop.hdfs.DFSClient:
> > DFSOutputStream ResponseProcessor exception  for
> > block
> > blk_7621973847448611459_244908java.io.IOException: Bad
> > response 1 for block
> > blk_7621973847448611459_244908 from datanode
> > <IP2>:50010
> >     at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2423)
> >
> > *after that, some of this appears:*
> >
> > 2010-04-01 11:36:20,602 INFO
> > org.apache.hadoop.hdfs.DFSClient: Exception in
> > createBlockOutputStream java.io.IOException: Bad connect
> > ack with
> > firstBadLink <IP2>:50010
> > 2010-04-01 11:36:20,602 INFO
> > org.apache.hadoop.hdfs.DFSClient: Abandoning
> > block blk_4280490438976631008_245009
> >
> > *and the FATAL:*
> >
> > 2010-04-01 11:36:32,634 FATAL
> > org.apache.hadoop.hbase.regionserver.HLog:
> > Could not append. Requesting close of hlog
> > java.io.IOException: Bad connect ack with firstBadLink
> > <IP2>:50010
> >     at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2872)
> >     at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2795)
> >     at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2078)
> >     at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2264)
> >
> > *this FATAL error appears many times until this one kicks
> > in:*
> >
> > 2010-04-01 11:38:57,281 FATAL
> > org.apache.hadoop.hbase.regionserver.MemStoreFlusher:
> > Replay of hlog
> > required. Forcing server shutdown
> > org.apache.hadoop.hbase.DroppedSnapshotException: region:
> > .META.,,1
> >     at
> >
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:977)
> >     at
> > org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:846)
> >     at
> >
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:241)
> >     at
> >
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:149)
> > Caused by: java.io.IOException: Bad connect ack with
> > firstBadLink
> > <IP2>:50010
> >     at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2872)
> >     at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2795)
> >     at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2078)
> >     at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2264)
> >
> > *(then the regionserver starts closing itself)*
> >
> > The regionserver on <IP6> was shut down, but problems
> > are corellated with
> > <IP2> (notice the ip in the error msgs). <IP2>
> > was also considered a dead
> > node after these errors, according to the hadoop namenode
> > web ui.
> > I think this is an hdfs failure, rather then
> > hbase/zookeeper (although it is
> > probably because of hbase high load...).
> >
> > On the datanodes, once in a while I had:
> >
> > 2010-04-01 11:24:59,265 ERROR
> > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > DatanodeRegistration(<IP2>:50010,
> > storageID=DS-1822315410-<IP2>-50010-1266860406782,
> > infoPort=50075,
> > ipcPort=50020):DataXceiver
> >
> > but these errors occured at different times, and not even
> > around crashes. No
> > fatal errors found on the datanode log (but it still
> > crashed).
> >
> > I haven't seen this exact error on the web (only similar
> > ones);
> > This guy (
> http://osdir.com/ml/hbase-user-hadoop-apache/2009-02/msg00186.html)
> > had a similar problem, but not exactly the same.
> >
> > Any ideas?
> > thanks,
> >
> > --
> > Oded
> >
>
>
>
>
>


-- 
Oded

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message