hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: Region servers going down under heavy write load
Date Wed, 05 Jun 2013 21:45:32 GMT
bq. I thought this property in hbase-site.xml takes care of that:
zookeeper.session.timeout

From
http://zookeeper.apache.org/doc/current/zookeeperProgrammers.html#ch_zkSessions:

The client sends a requested timeout, the server responds with the timeout
that it can give the client. The current implementation requires that the
timeout be a minimum of 2 times the tickTime (as set in the server
configuration) and a maximum of 20 times the tickTime. The ZooKeeper client
API allows access to the negotiated timeout.
The above means the shared zookeeper quorum may return timeout value
different from that of zookeeper.session.timeout

Cheers

On Wed, Jun 5, 2013 at 2:34 PM, Ameya Kantikar <ameya@groupon.com> wrote:

> In zoo.cfg I have not setup this value explicitly. My zoo.cfg looks like:
>
> tickTime=2000
> initLimit=10
> syncLimit=5
>
> We use common zoo keeper cluster for 2 of our HBase clusters. I'll try
> increasing this value from zoo.cfg.
> However is it possible to set this value cluster specific?
> I thought this property in hbase-site.xml takes care of that:
> zookeeper.session.timeout
>
>
> On Wed, Jun 5, 2013 at 1:49 PM, Kevin O'dell <kevin.odell@cloudera.com
> >wrote:
>
> > Ameya,
> >
> >   What does your zoo.cfg say for your timeout value?
> >
> >
> > On Wed, Jun 5, 2013 at 4:47 PM, Ameya Kantikar <ameya@groupon.com>
> wrote:
> >
> > > Hi,
> > >
> > > We have heavy map reduce write jobs running against our cluster. Every
> > once
> > > in a while, we see a region server going down.
> > >
> > > We are on : 0.94.2-cdh4.2.0, r
> > >
> > > We have done some tuning for heavy map reduce jobs, and have increased
> > > scanner timeouts, lease timeouts, have also tuned memstore as follows:
> > >
> > > hbase.hregion.memstore.block.multiplier: 4
> > > hbase.hregion.memstore.flush.size: 134217728
> > > hbase.hstore.blockingStoreFiles: 100
> > >
> > > So now, we are still facing issues. Looking at the logs it looks like
> due
> > > to zoo keeper timeout. We have tuned zookeeper settings as follows on
> > > hbase-sie.xml:
> > >
> > > zookeeper.session.timeout: 300000
> > > hbase.zookeeper.property.tickTime: 6000
> > >
> > >
> > > The actual log looks like:
> > >
> > >
> > > 2013-06-05 11:46:40,405 WARN org.apache.hadoop.ipc.HBaseServer:
> > > (responseTooSlow):
> > > {"processingtimems":13468,"call":"next(6723331143689528698, 1000), rpc
> > > version=1, client version=29, methodsFingerPrint=54742778","client":"
> > > 10.20.73.65:41721
> > >
> > >
> >
> ","starttimems":1370432786933,"queuetimems":1,"class":"HRegionServer","responsesize":39611416,"method":"next"}
> > >
> > > 2013-06-05 11:46:54,988 INFO org.apache.hadoop.io.compress.CodecPool:
> Got
> > > brand-new decompressor [.snappy]
> > >
> > > 2013-06-05 11:48:03,017 WARN org.apache.hadoop.hdfs.DFSClient:
> > > DFSOutputStream ResponseProcessor exception  for block
> > > BP-53741567-10.20.73.56-1351630463427:blk_9026156240355850298_8775246
> > > java.io.EOFException: Premature EOF: no length prefix available
> > >         at
> > >
> > >
> >
> org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162)
> > >         at
> > >
> > >
> >
> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:95)
> > >         at
> > >
> > >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:656)
> > >
> > > 2013-06-05 11:48:03,020 WARN org.apache.hadoop.hbase.util.Sleeper: *We
> > > slept 48686ms instead of 3000ms*, this is likely due to a long garbage
> > > collecting pause and it's usually bad, see
> > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > >
> > > 2013-06-05 11:48:03,094 FATAL
> > > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
> > server
> > > smartdeals-hbase14-snc1.snc1,60020,1370373396890: Unhandled exception:
> > > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> > > currently processing smartdeals-hbase14-snc1.snc1,60020,1370373396890
> as
> > > dead server
> > >
> > > (Not sure why it says 3000ms when we have timeout at 300000ms)
> > >
> > > We have done some GC tuning as well. Wondering what I can tune from
> > making
> > > RS going down? Any ideas?
> > > This is batch heavy cluster, and we care less about read latency. We
> can
> > > increase RAM bit more but not much (Already RS has 20GB memory)
> > >
> > > Thanks in advance.
> > >
> > > Ameya
> > >
> >
> >
> >
> > --
> > Kevin O'Dell
> > Systems Engineer, Cloudera
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message