hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Abhijit Pol <a...@rocketfuel.com>
Subject Re: increasing hbase get latencies
Date Fri, 10 Jun 2011 17:10:49 GMT
Thanks for you reply, stack. I have answers inline. By timeout % we refer
to, % of requests that exceeds our acceptable latency threshold.

On Wed, Jun 8, 2011 at 11:31 PM, Stack <stack@duboce.net> wrote:

> What happens if you flush that table/region when its slow (you can do
> it from the shell).  Does the latency go back down?
> performed table flush at around peak timeouts 35%. Timeouts went up a bit
during flush and then dropped to 28% mark (compared to 10% timeouts after
restart of server). They started climbing up again and reached 35% mark in
an hour or so.

> Are there lots of storefiles under these regions? (Check the fs.  Do a
> lsr on region you know is slow).  If lots of storefiles, if you major
> compact the region does the latency go back down?
> we do major compact once a day during off peak night time. during major
compaction our timeouts go even higher (40%) and after major compaction they
come back to previous high and keep increasing from there.

for our main table with two column families, 300 regions across 10 machines,
each machine has around 1500 files over the course of day. After major
compaction they come down to 1300 or so.

If you look in your regionserver logs, what do the stats on your block
> cache look like?  Do the cache hits climb over time?
> cache hit starts out at around 80% on restart and then climbs up
and stabilizes at around 90% within an hr or two. (total RAM per RS, 98GB,
60% given to HBase, 50% of which is block cache and 40% memstore)

> The client may have gone away because we took too long to process the
> request.  How many handlers are you running?  Maybe the requests are
> backing up in rpc queues?
> we are using 500 handler count per region server. checked on RPC queue time
stats from metrics, it is zero most of the times, occasionally see single
digit number for it.
what are side effects of going for higher handler count? more memory?

> Why 0.90.0 and not 0.90.3 (has some fixes).
> yes, it's on our list. looks like we should do it sooner than later.

>  > since server restart make things look good, is this might be related to
> > minor compaction & block cache?
> >
> Give us some answers to a few of the above questions.  Might help us
> narrow in on whats going on.
few more inputs:
increase in timeout % is highly correlated to read request load. During the
day when read requests are high, rate of increase in timeouts is higher
compared to night. However, if we restart server at any point in time,
timeouts goes back to 10% and start increasing. And all these issues started
when we increased our read volume from peak 30k qps to peak 60k qps. Our
write volume is stable for a while at peak 3k qps.

whenever client gets response back from hbase (with found or missing key)
its always fast: max being less 10ms, and for timeout % read requests we
wait for 32ms (timeout % range being from 10%-35%).

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message