hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Buckley,Ron" <buckl...@oclc.org>
Subject RE: 0.92 and Read/writes not scaling
Date Wed, 28 Mar 2012 12:41:15 GMT

We've been working on some similar performance testing on our 50 node
cluster running 0.92.1 and CDH3U3.

We were looking mostly at reads, but observed similar behavior. HBase
wasn't particularly busy, but we couldn't make it go faster.

Some debugging later, we found that many (sometimes most) of our
responses from HBase would return in 20 or 40 ms.  It was kind of
interesting to watch, we'd ask for the same row over and over, it would
either return in 0 ms, 20 ms, or 40 ms. 

Looking around we found some related Jira's:

We added the settings to our below config to disable Nagle.

For us, setting these two, got rid of  all of the 20 and 40 ms response
times and dropped the average response time we measured from HBase by
more than half.  Plus, we can push HBase a lot harder. 



-----Original Message-----
From: Juhani Connolly [mailto:juhani_connolly@cyberagent.co.jp] 
Sent: Wednesday, March 28, 2012 4:27 AM
To: user@hbase.apache.org
Subject: Re: 0.92 and Read/writes not scaling

I think there is a lot of stuff in this and the situation has changed a 
bit so I'd like to summarize the current situation and verify a few

Our current environment:
- CDH 4b1: hdfs 0.23 and hbase 0.92
- separate master and namenode, 64gb, 24 cores each, colocating with 
zookeepers(third zookeeper on a separate  unshared server)
- 11 datanode/regionservers, 24 cores, 64gb, 4 * 1.5tb disks(should 
become a  bottleneck but isn't yet)
- Table is split into approx 300 regions and is balanced with from 25-35

regions/server, using snappy compression. Unless otherwise mentioned 
delayed flushing is disabled

The current problem:
- Flushed writes seem slow compared to our previous setup(which was the 
same but using hdfs 0.20.2).
  - Hardware usage is poor with no visible hardware bottlenecks(this was

also the case with our old setup)

- YCSB, PerformanceEvaluation, application specific throughput test and 
a generic testing solution(attaching a simplified version that includes 
the core issues and works standalone)
- On our hdfs 0.20.2 setup, we were getting throughput of 40,000 
writes/sec(128-256 bytes each), or higher if we delayed log flushes, 
used batch puts, or similar.
- On our new setup, we are  getting about 15,000 wps. If  we use the 
non-flushing setup(-t writeunflushed in the attached test) however we 
can easily push 10 times that
- Hardware not  creating bottlenecks is generally evidenced by ganglia, 
top, iostat -d, iperf and a  number of others.
- We tested append speed with DFSIOTest using 256 byte entries and 10 
files, giving us a throughput of 64mb(about 250,000 entries per second 
in theory then), so wal writes really should be able to keep up with a 
lot of throughput?

One doubt:
- While we are fairly confident this is not the case, the only thing I 
could think of is that there autoFlush was off for our tests with 
0.20.2. We used the same test program on both versions, and it is only 
today that I explicitly set it to off(so it has been working on the 
default). We never set the writebuffer size.

What I'd like to know:
- What kind of throughput are people getting on data that is fully 
AutoFlushed(so every entry is sent to the wall as table.put() is called?

Are our figures(a bit over 1000 per node) normal? Or should we be 
expecting the figures(4-5000 per sec per node) that we were getting on 
hdfs 0.20.2?
- Do people normally see their hardware get anywhere near maxing out on 
heavy write load?
- Is there something wrong with the way we are testing?

On 03/27/2012 12:18 PM, Juhani Connolly wrote:
> Hi Todd,
> Here's our thread dumps from one of our slave nodes while running a
> The particular load was set up to grab a table from a tablepool, stop 
> it from autoflushing, put 1000 entries from 128-256 bytes each in(the 
> keys being a random spread throughout the entire keyspace) and then 
> manually flushed. The average latency is an attrocious 58 seconds, 
> though of course it is nothing like that if we use single puts or 
> small batches...
> Also put in our configs... They had more in them but we stripped them 
> down a lot to try to get at the problem source, no luck though(we took

> them down to the bare minimum as well but  that didn't change things 
> so we restored some of  the settings).
> Thanks,
>  Juhani
> On 03/27/2012 10:43 AM, Todd Lipcon wrote:
>> Hi Juhani,
>> I wouldn't have expected CDH4b1 (0.23) to be slower than 0.20 for
>> writes. They should be around the same speed, or even a little faster
>> in some cases. That said, I haven't personally run any benchmarks in
>> several months on this setup. I know our performance/QA team has done
>> some, so I asked them to take a look. Hopefully we should have some
>> results soon.
>> If you can take 10-20 jstacks of the RegionServer and the DN on that
>> same machine while performing your write workload, that would be
>> helpful. It's possible we had a regression during some recent
>> development right before the 4b1 release. If you're feeling
>> adventurous, you can also try upgrading to CDH4b2 snapshot builds,
>> which do have a couple of performance improvements/bugfixes that may
>> help. Drop by #cloudera on IRC and one of us can point you in the
>> right direction if you're willing to try (though of course the
>> builds are somewhat volatile and haven't had any QA)
>> -Todd
>> On Mon, Mar 26, 2012 at 10:08 AM, Juhani Connolly<juhanic@gmail.com>

>> wrote:
>>> On Tue, Mar 27, 2012 at 1:42 AM, Stack<stack@duboce.net>  wrote:
>>>> On Mon, Mar 26, 2012 at 6:58 AM, Matt Corgan<mcorgan@hotpads.com> 

>>>> wrote:
>>>>> When you increased regions on your previous test, did it start 
>>>>> maxing out
>>>>> CPU?  What improvement did you see?
>>>> What Matt asks, what is your cluster doing?  What changes do you
>>>> when you say, increase size of your batching or as Mat asks, what
>>>> the difference when you went from less to more regions?
>>> None of our hardware is even near its limit. Ganglia rarely has a
>>> single machine over 25% load, and we have verified io, network, cpu
>>> and memory all have plenty of breathing space with other tools(top,
>>> iostat, dstat and others mentioned in the hstack article).
>>>>> Have you tried increasing the memstore flush size to something 
>>>>> like 512MB?
>>>>>   Maybe you're blocked on flushes.  40,000 (4,000/server) is 
>>>>> pretty slow for
>>>>> a disabled WAL i think, especially with batch size of 10.  If you 
>>>>> increase
>>>>> write batch size to 1000 how much does your write throughput 
>>>>> increase?
>>>> The above sounds like something to try -- upping flush sizes.
>>>> Are you spending your time compacting all the time?  For kicks try
>>>> disabling compactions when doing your write tests.  Does it make a
>>>> difference?  What does ganglia show as hot?  Are you network-bound,
>>>> io-bound, cpu-bound?
>>>> Thanks,
>>>> St.Ack
>>> The compaction and flush times according to ganglia are pretty short
>>> and insignificant. I've also been watching the rpcs and past events
>>> from the html control panel which don't seem to be indicative of a
>>> problem. However I will try changing the flushes and using bigger
>>> batches, it might turn up something interesting, thanks.

View raw message