hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oded Rosen <o...@legolas-media.com>
Subject Re: Recommendations (WAS -> Re: DFSClient errors during massive HBase load)
Date Sat, 17 Apr 2010 16:26:39 GMT
I Thought we've reached stability, but we didn't :(

Brief reminder: I have a small cluster, 3 regionservers (+datanodes), 1
master (+namenode).
We preform a massive load of data into hbase every few minutes.
I had problems of regionservers shutting down, and then I've fixed the
following issues (also covered in the previous posts in this thread):
- ulimit to 32k for user hadoop
- swappiness to zero (cache is almost never used in my regionserver.)
- max.xcievers to 2047

The problems vanished for several weeks but now they are back.
Logs from a crashed regionserver are attached,
(few entries before + after the fatal error. no more exceptions were in that
log).

Our short experiance with hbase is filled with those crashes.
Any directions?
Thanks,


2010-04-17 10:08:01,267 INFO org.apache.hadoop.hdfs.DFSClient: Exception in
createBlockOutputStream java.io.EOFException
2010-04-17 10:08:01,267 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
block blk_-728993511454589630_565607
2010-04-17 10:08:06,302 INFO org.apache.hadoop.hbase.regionserver.HRegion:
Blocking updates for 'IPC Server handler 5 on 60020' on region
users,,1271516390724: memstore size 128.0m is >= than blocking 128.0m size
2010-04-17 10:08:07,270 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer
Exception: java.io.IOException: Unable to create new block.
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2814)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2078)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2264)

2010-04-17 10:08:07,270 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_-728993511454589630_565607 bad datanode[0] nodes ==
null
2010-04-17 10:08:07,270 WARN org.apache.hadoop.hdfs.DFSClient: Could not get
block locations. Source file
"/hbase/users/2065286203/events/1500475903353007638" - Aborting...
*2010-04-17 10:08:07,273 FATAL
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Replay of hlog
required. Forcing server shutdown*
org.apache.hadoop.hbase.DroppedSnapshotException: region:
users,,1271516390724
    at
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:977)
    at
org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:846)
    at
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:241)
    at
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:149)
Caused by: java.io.EOFException
    at java.io.DataInputStream.readByte(DataInputStream.java:250)
    at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
    at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
    at org.apache.hadoop.io.Text.readString(Text.java:400)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2870)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2795)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2078)
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2264)
2010-04-17 10:08:07,274 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
request=0.0, regions=2, stores=5, storefiles=0, storefileIndexSize=0,
memstoreSize=128, compactionQueueSize=0, usedHeap=168, maxHeap=987,
blockCacheSize=1700064, blockCacheFree=205393696, blockCacheCount=0,
blockCacheHitRatio=0, fsReadLatency=0, fsWriteLatency=0, fsSyncLatency=0
2010-04-17 10:08:07,274 INFO
org.apache.hadoop.hbase.regionserver.MemStoreFlusher:
regionserver/10.21.204.132:60020.cacheFlusher exiting
2010-04-17 10:08:07,299 INFO org.apache.hadoop.hbase.regionserver.HLog: Roll
/hbase/.logs/udbs1,60020,1271516277019/hlog.dat.1271516861117,
entries=183804, calcsize=85969764, filesize=66248416. New hlog
/hbase/.logs/udbs1,60020,1271516277019/hlog.dat.1271516887280
2010-04-17 10:08:07,299 INFO org.apache.hadoop.hbase.regionserver.LogRoller:
LogRoller exiting.
2010-04-17 10:08:07,356 INFO
org.apache.hadoop.hbase.regionserver.LogFlusher:
regionserver/10.21.204.132:60020.logFlusher exiting
2010-04-17 10:08:07,527 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
server on 60020
2010-04-17 10:08:07,528 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Flush requested on users,,1271516390724
2010-04-17 10:08:07,528 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 18 on 60020: exiting
2010-04-17 10:08:07,528 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer
2010-04-17 10:08:07,528 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC
Server listener on 60020
2010-04-17 10:08:07,528 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 0 on 60020: exiting
2010-04-17 10:08:07,528 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 23 on 60020: exiting
2010-04-17 10:08:07,528 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 2 on 60020: exiting
2010-04-17 10:08:07,528 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 19 on 60020: exiting
2010-04-17 10:08:07,528 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 6 on 60020: exiting
2010-04-17 10:08:07,528 INFO org.apache.hadoop.ipc.HBaseServer: Stopping IPC
Server Responder
2010-04-17 10:08:07,528 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 11 on 60020: exiting
2010-04-17 10:08:07,528 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 22 on 60020: exiting
2010-04-17 10:08:07,529 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 16 on 60020: exiting
2010-04-17 10:08:07,528 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 1 on 60020: exiting
2010-04-17 10:08:07,529 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 20 on 60020: exiting
2010-04-17 10:08:07,528 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 8 on 60020: exiting
2010-04-17 10:08:07,529 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 21 on 60020: exiting
2010-04-17 10:08:07,528 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 4 on 60020: exiting
2010-04-17 10:08:07,528 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 12 on 60020: exiting
2010-04-17 10:08:07,529 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 17 on 60020: exiting
2010-04-17 10:08:07,529 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 3 on 60020: exiting
2010-04-17 10:08:07,529 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 13 on 60020: exiting
2010-04-17 10:08:07,529 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 15 on 60020: exiting
2010-04-17 10:08:07,529 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 24 on 60020: exiting
2010-04-17 10:08:07,633 DEBUG org.apache.hadoop.hbase.regionserver.HLog:
closing hlog writer in
hdfs://udbm1:8020/hbase/.logs/udbs1,60020,1271516277019
2010-04-17 10:08:07,633 INFO
org.apache.hadoop.hbase.regionserver.CompactSplitThread:
regionserver/10.21.204.132:60020.compactor exiting
2010-04-17 10:08:07,634 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker:
regionserver/10.21.204.132:60020.majorCompactionChecker exiting
2010-04-17 10:08:07,653 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: On abort, closed hlog
2010-04-17 10:08:07,653 DEBUG
org.apache.hadoop.hbase.regionserver.HRegionServer: closing region
log,,1271516431393
2010-04-17 10:08:07,653 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Closing log,,1271516431393: compactions & flushes disabled
2010-04-17 10:08:07,653 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Updates disabled for region, no outstanding scanners on log,,1271516431393
2010-04-17 10:08:07,653 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
No more row locks outstanding on region log,,1271516431393
2010-04-17 10:08:07,653 DEBUG org.apache.hadoop.hbase.regionserver.Store:
closed data
2010-04-17 10:08:07,653 INFO org.apache.hadoop.hbase.regionserver.HRegion:
Closed log,,1271516431393
2010-04-17 10:08:07,653 DEBUG
org.apache.hadoop.hbase.regionserver.HRegionServer: closing region
users,,1271516390724
2010-04-17 10:08:07,653 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Closing users,,1271516390724: compactions & flushes disabled
2010-04-17 10:08:07,653 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Updates disabled for region, no outstanding scanners on users,,1271516390724
2010-04-17 10:08:07,653 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
No more row locks outstanding on region users,,1271516390724
2010-04-17 10:08:07,653 DEBUG org.apache.hadoop.hbase.regionserver.Store:
closed family1
2010-04-17 10:08:07,654 DEBUG org.apache.hadoop.hbase.regionserver.Store:
closed family2
2010-04-17 10:08:07,654 DEBUG org.apache.hadoop.hbase.regionserver.Store:
closed family3
2010-04-17 10:08:07,654 DEBUG org.apache.hadoop.hbase.regionserver.Store:
closed family4
2010-04-17 10:08:07,654 INFO org.apache.hadoop.hbase.regionserver.HRegion:
Closed users,,1271516390724
2010-04-17 10:08:07,654 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: aborting server at:
10.21.204.132:60020



On Mon, Apr 12, 2010 at 9:44 PM, Stack <stack@duboce.net> wrote:

> Thanks for the writing back the list Oded.  I changed the subject so
> its easier to find your suggestions amongst the mailing list weeds
> going forward.  On swappyness, setting it to 0 is extreme but since
> you've supplied links, users can do as you suggest or do something not
> so radical.
>
> Good stuff,
>
> St.Ack
>
>
> On Mon, Apr 12, 2010 at 11:31 AM, Oded Rosen <oded@legolas-media.com>
> wrote:
> > The tips you guys gave me made a huge difference.
> > I also used other tips from the "Troubleshooting" section in hbase wiki,
> and
> > from all around the web.
> > I would like to share my current cluster configuration, as only few
> places
> > around the web offer a guided tour of these important configuration
> changes.
> > This might be helpful for other people with small clusters, that have
> > problems with loading large amounts of data to hbase on a regular basis.
> > I am not a very experienced user (yet...) so if I got something wrong, or
> if
> > I am missing anything, please say so. Thanks in advance
> >
> > *1. Prevent your regionserver machines from memory swap* - this is a must
> > have, it seems, for small hbase clusters that handle large loads:
> >
> > *Edit this file (on each regionserver) and then activate the following
> > commands.*
> >
> > *File:* /etc/sysctl.conf
> > *Add Values:*
> > vm.swappiness = 0 (this one - on datanodes only!)
> >
> > *Then run (In order to apply the changes immediately):*
> > sysctl -p /etc/sysctl.conf
> > service network restart
> >
> > note: this is a kernel property change. swappiness with a zero value
> means
> > the machine will not use virtual memory at all (or at least that what I
> > understood). So handle with care. a low value (around 5 or 10, from the
> > maximum value of 100) might also work. My configuration is zero.
> >
> > (Further explanations:
> >
> http://cloudepr.blogspot.com/2009/09/cluster-facilities-hardware-and.html)
> >
> > *2. Increase file descriptor limit* - this is also a must have for almost
> > any use of hbase.
> >
> > *Edit these two files (on each datanode/namenode) and then activate the
> > following commands.*
> >
> > *File:* /etc/security/limits.conf
> > *Add Values:*
> >
> > hadoop soft nofile 32768
> > hadoop hard nofile 32768
> >
> > *File:* /etc/sysctl.conf
> > *Add Values:*
> >
> > fs.file-max = 32768
> >
> > *Then run:*
> > sysctl -p /etc/sysctl.conf
> > service network restart
> > note: you can perform steps 1+2 together, they both edit sysctl.conf.
> notice
> > step 1 is only for regionservers (datanodes),
> > while this one can also be made to the master (namenode) - although I'm
> not
> > so sure it's necessary.
> >
> > (see http://wiki.apache.org/hadoop/Hbase/Troubleshooting<
> http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A6>)
> >
> > *3. Raise HDFS + HBASE connection limit upper bound:*
> >
> > Edit hadoop/hbase configuration files to include these entries:
> > (you might want to change the specific values, according to your cluster
> > properties and usage).
> >
> > *File:* hdfs-site.xml
> > *Add Properties:*
> >
> > name: dfs.datanode.max.xcievers
> > value: 2047
> >
> > name: dfs.datanode.handler.count
> > value: 10 (at least as the number of nodes in the cluster, or more if
> > needed).
> >
> > (see http://wiki.apache.org/hadoop/Hbase/Troubleshooting<
> http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A6>
> > )
> >
> > *File:* hbase-site.xml
> > *Add Properties:*
> >
> >  <property>
> >    <name>hbase.regionserver.handler.count</name>
> >    <value>100</value>
> >  </property>
> >  <property>
> >    <name>hbase.zookeeper.property.maxClientCnxns</name>
> >    <value>100</value>
> >  </property>
> >
> > (see
> http://hadoop.apache.org/hbase/docs/current/api/overview-summary.html<
> http://hadoop.apache.org/hbase/docs/current/api/overview-summary.html#overview_description
> >)
> >
> > If you can remember other changes you've made to increase hbase
> stability,
> > you are welcome to reply.
> >
> > Cheers.
> >
> > On Thu, Apr 1, 2010 at 11:43 PM, Andrew Purtell <apurtell@apache.org>
> wrote:
> >
> >> First,
> >>
> >>  "ulimit: 1024"
> >>
> >> That's fatal. You need to up file descriptors to something like 32K.
> >>
> >> See http://wiki.apache.org/hadoop/Hbase/Troubleshooting, item #6
> >>
> >> From there, let's see.
> >>
> >>    - Andy
> >>
> >> > From: Oded Rosen <oded@legolas-media.com>
> >> > Subject: DFSClient errors during massive HBase load
> >> > To: hbase-user@hadoop.apache.org
> >> > Date: Thursday, April 1, 2010, 1:19 PM
> >> > **Hi all,
> >> >
> >> > I have a problem with a massive HBase loading job.
> >> > It is from raw files to hbase, through some mapreduce
> >> > processing +
> >> > manipulating (so loading direcly to files will not be
> >> > easy).
> >> >
> >> > After some dozen million successful writes - a few hours of
> >> > load - some of
> >> > the regionservers start to die - one by one - until the
> >> > whole cluster is
> >> > kaput.
> >> > The hbase master sees a "znode expired" error each time a
> >> > regionserver
> >> > falls. The regionserver errors are attached.
> >> >
> >> > Current configurations:
> >> > Four nodes - one namenode+master, three
> >> > datanodes+regionservers.
> >> > dfs.datanode.max.xcievers: 2047
> >> > ulimit: 1024
> >> > servers: fedora
> >> > hadoop-0.20, hbase-0.20, hdfs (private servers, not on ec2
> >> > or anything).
> >> >
> >> >
> >> > *The specific errors from the regionserver log (from
> >> > <IP6>, see comment):*
> >> >
> >> > 2010-04-01 11:36:00,224 WARN
> >> > org.apache.hadoop.hdfs.DFSClient:
> >> > DFSOutputStream ResponseProcessor exception  for
> >> > block
> >> > blk_7621973847448611459_244908java.io.IOException: Bad
> >> > response 1 for block
> >> > blk_7621973847448611459_244908 from datanode
> >> > <IP2>:50010
> >> >     at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2423)
> >> >
> >> > *after that, some of this appears:*
> >> >
> >> > 2010-04-01 11:36:20,602 INFO
> >> > org.apache.hadoop.hdfs.DFSClient: Exception in
> >> > createBlockOutputStream java.io.IOException: Bad connect
> >> > ack with
> >> > firstBadLink <IP2>:50010
> >> > 2010-04-01 11:36:20,602 INFO
> >> > org.apache.hadoop.hdfs.DFSClient: Abandoning
> >> > block blk_4280490438976631008_245009
> >> >
> >> > *and the FATAL:*
> >> >
> >> > 2010-04-01 11:36:32,634 FATAL
> >> > org.apache.hadoop.hbase.regionserver.HLog:
> >> > Could not append. Requesting close of hlog
> >> > java.io.IOException: Bad connect ack with firstBadLink
> >> > <IP2>:50010
> >> >     at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2872)
> >> >     at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2795)
> >> >     at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2078)
> >> >     at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2264)
> >> >
> >> > *this FATAL error appears many times until this one kicks
> >> > in:*
> >> >
> >> > 2010-04-01 11:38:57,281 FATAL
> >> > org.apache.hadoop.hbase.regionserver.MemStoreFlusher:
> >> > Replay of hlog
> >> > required. Forcing server shutdown
> >> > org.apache.hadoop.hbase.DroppedSnapshotException: region:
> >> > .META.,,1
> >> >     at
> >> >
> >>
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:977)
> >> >     at
> >> >
> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:846)
> >> >     at
> >> >
> >>
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:241)
> >> >     at
> >> >
> >>
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:149)
> >> > Caused by: java.io.IOException: Bad connect ack with
> >> > firstBadLink
> >> > <IP2>:50010
> >> >     at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2872)
> >> >     at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2795)
> >> >     at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2078)
> >> >     at
> >> >
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2264)
> >> >
> >> > *(then the regionserver starts closing itself)*
> >> >
> >> > The regionserver on <IP6> was shut down, but problems
> >> > are corellated with
> >> > <IP2> (notice the ip in the error msgs). <IP2>
> >> > was also considered a dead
> >> > node after these errors, according to the hadoop namenode
> >> > web ui.
> >> > I think this is an hdfs failure, rather then
> >> > hbase/zookeeper (although it is
> >> > probably because of hbase high load...).
> >> >
> >> > On the datanodes, once in a while I had:
> >> >
> >> > 2010-04-01 11:24:59,265 ERROR
> >> > org.apache.hadoop.hdfs.server.datanode.DataNode:
> >> > DatanodeRegistration(<IP2>:50010,
> >> > storageID=DS-1822315410-<IP2>-50010-1266860406782,
> >> > infoPort=50075,
> >> > ipcPort=50020):DataXceiver
> >> >
> >> > but these errors occured at different times, and not even
> >> > around crashes. No
> >> > fatal errors found on the datanode log (but it still
> >> > crashed).
> >> >
> >> > I haven't seen this exact error on the web (only similar
> >> > ones);
> >> > This guy (
> >> http://osdir.com/ml/hbase-user-hadoop-apache/2009-02/msg00186.html)
> >> > had a similar problem, but not exactly the same.
> >> >
> >> > Any ideas?
> >> > thanks,
> >> >
> >> > --
> >> > Oded
> >> >
> >>
> >>
> >>
> >>
> >>
> >
> >
> > --
> > Oded
> >
>



-- 
Oded

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message