hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: increasing hbase get latencies
Date Fri, 10 Jun 2011 19:43:36 GMT
On Fri, Jun 10, 2011 at 10:10 AM, Abhijit Pol <apol@rocketfuel.com> wrote:
>> performed table flush at around peak timeouts 35%. Timeouts went up a bit
> during flush and then dropped to 28% mark (compared to 10% timeouts after
> restart of server). They started climbing up again and reached 35% mark in
> an hour or so.
>
This would seem to indicate that getting from memstore is at least
part of the problem (but not the complete explaination.  HBASE-3855
should help.  FYI, HBASE-3855 won't be in a release till 0.90.4 hbase)

>> we do major compact once a day during off peak night time. during major
> compaction our timeouts go even higher (40%) and after major compaction they
> come back to previous high and keep increasing from there.
>

This would seem to say that the number of storefiles is NOT the issue.


> for our main table with two column families, 300 regions across 10 machines,
> each machine has around 1500 files over the course of day. After major
> compaction they come down to 1300 or so.
>

30 regions per machine with two column families per region would seem
to suggest that after a major compaction, you should have only 60
storefiles or so.  There are other regions on these regionservers and
they make up the bulk of the storefiles?


> If you look in your regionserver logs, what do the stats on your block
>> cache look like?  Do the cache hits climb over time?
>>
>> cache hit starts out at around 80% on restart and then climbs up
> and stabilizes at around 90% within an hr or two. (total RAM per RS, 98GB,
> 60% given to HBase, 50% of which is block cache and 40% memstore)
>

So, HBase has about 48G heap?

Could it be GC frolics that are responsible for the other portion of
the slow down?  Seems like you are GC logging.  Does overall pause
time tend upward?

>> The client may have gone away because we took too long to process the
>> request.  How many handlers are you running?  Maybe the requests are
>> backing up in rpc queues?
>>
>> we are using 500 handler count per region server. checked on RPC queue time
> stats from metrics, it is zero most of the times, occasionally see single
> digit number for it.
> what are side effects of going for higher handler count? more memory?
>

Contention if too many handlers in flight at the one time.  RPC also
keeps a queue per handler instance.  Backed up queues are holding
edits in memory.  Doesn't seem like this is your issue though.  Stuff
seems to be moving right along.  Anything else interesting in those
rpc metrics?  You see rising latency here?  Can you finger any
particular invocation?


>> Why 0.90.0 and not 0.90.3 (has some fixes).
>>
>> yes, it's on our list. looks like we should do it sooner than later.
>

One advantage of newer stuff is you can use the decommission script to
change configs on a single RS to try things w/o disrupting cluster
loading.



> increase in timeout % is highly correlated to read request load. During the
> day when read requests are high, rate of increase in timeouts is higher
> compared to night. However, if we restart server at any point in time,
> timeouts goes back to 10% and start increasing. And all these issues started
> when we increased our read volume from peak 30k qps to peak 60k qps. Our
> write volume is stable for a while at peak 3k qps.
>

You might consider patching your hadoop with hdfs-347.  See the issue.
 Lots of upsides.  Downsides are its experimental!!!! and currently
posted patch does not checksum (We are running this patch on our
frontend. FB runs a version of this patch on at least one of their
clusters).

> (2)
> whenever client gets response back from hbase (with found or missing key)
> its always fast: max being less 10ms, and for timeout % read requests we
> wait for 32ms (timeout % range being from 10%-35%).
>

Can you figure more in here?

St.Ack

Mime
View raw message