hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <Michael.Grund...@high5games.com>
Subject RE: HBase Client Performance Bottleneck in a Single Virtual Machine
Date Mon, 04 Nov 2013 18:32:55 GMT
I've gone through the code in detail - we are using unmanaged connections and they are not
being closed when the table is closed. Thanks!

From: lars hofhansl [mailto:larsh@apache.org]
Sent: Monday, November 04, 2013 12:03 PM
To: Michael Grundvig; user@hbase.apache.org
Subject: Re: HBase Client Performance Bottleneck in a Single Virtual Machine

HConnectionManager.createConnection is a different API creating an "unmanaged" connection.
If you're not using that each HTable.close() might close the underlying connection.

-- Lars

From: "Michael.Grundvig@high5games.com<mailto:Michael.Grundvig@high5games.com>" <Michael.Grundvig@high5games.com<mailto:Michael.Grundvig@high5games.com>>
To: user@hbase.apache.org<mailto:user@hbase.apache.org>; larsh@apache.org<mailto:larsh@apache.org>
Sent: Sunday, November 3, 2013 9:36 PM
Subject: RE: HBase Client Performance Bottleneck in a Single Virtual Machine

Hi Lars, at application startup the pool is created with X number of connections using the
first method you indicated: HConnectionManager.createConnection(conf). We store each connection
in the pool automatically and serve it up to threads as they request it. When a thread is
done using the connection, they return it back to the pool. The connections are not be created
and closed per thread, but only once for the entire application. We are using the GenericObjectPool
from Apache Commons Pooling as the foundation of our connection pooling approach. Our entire
pool implementation really consists of just a couple overridden methods to specify how to
create a new connection and close it. The GenericObjectPool class does all the rest. See here
for details:  http://commons.apache.org/proper/commons-pool/

Each thread is getting a HTableInstance as needed and then closing it when done. The only
thing we are not doing is using the createConnection method that takes in an ExecutorService
as that wouldn't work in our model. Our app is like a web application - the thread pool is
managed outside the scope of our application code so we can't assume the service is available
at connection creation time. Thanks!


-----Original Message-----
From: lars hofhansl [mailto:larsh@apache.org<mailto:larsh@apache.org>]
Sent: Sunday, November 03, 2013 11:27 PM
To: user@hbase.apache.org<mailto:user@hbase.apache.org>
Subject: Re: HBase Client Performance Bottleneck in a Single Virtual Machine

Hi Micheal,

can you try to create a single HConnection in your client:
HConnectionManager.createConnection(Configuration conf) or HConnectionManager.createConnection(Configuration
conf, ExecutorService pool)

Then use HConnection.getTable(...) each time you need to do an operation.

Configuration conf = ...;
ExecutorService pool = ...;
// create a single HConnection for you vm.
HConnection con = HConnectionManager.createConnection(Configuration conf, ExecutorService
pool); // reuse the connection for many tables, even in different threads HTableInterface
table = con.getTable(...); // use table even for only a few operation.
HTableInterface table = con.getTable(...); // use table even for only a few operation.
// at the end close the connection

-- Lars

From: "Michael.Grundvig@high5games.com<mailto:Michael.Grundvig@high5games.com>" <Michael.Grundvig@high5games.com<mailto:Michael.Grundvig@high5games.com>>
To: user@hbase.apache.org<mailto:user@hbase.apache.org>
Sent: Sunday, November 3, 2013 7:46 PM
Subject: HBase Client Performance Bottleneck in a Single Virtual Machine

Hi all; I posted this as a question on StackOverflow as well but realized I should have gone
straight ot the horses-mouth with my question. Sorry for the double post!

We are running a series of HBase tests to see if we can migrate one of our existing datasets
from a RDBMS to HBase. We are running 15 nodes with 5 zookeepers and HBase 0.94.12 for this

We have a single table with three column families and a key that is distributing very well
across the cluster. All of our queries are running a direct look-up; no searching or scanning.
Since the HTablePool is now frowned upon, we are using the Apache commons pool and a simple
connection factory to create a pool of connections and use them in our threads. Each thread
creates an HTableInstance as needed and closes it when done. There are no leaks we can identify.

If we run a single thread and just do lots of random calls sequentially, the performance is
quite good. Everything works great until we start trying to scale the performance. As we add
more threads and try and get more work done in a single VM, we start seeing performance degrade
quickly. The client code is simply attempting to run either one of several gets or a single
put at a given frequency. It then waits until the next time to run, we use this to simulate
the workload from external clients. With a single thread, we will see call times in the 2-3
milliseconds which is acceptable.

As we add more threads, this call time starts increasing quickly. What gets strange is if
we add more VMs, the times hold steady across them all so clearly it's a bottleneck in the
running instance and not the cluster. We can get a huge amount of processing happening across
the cluster very easily - it just has to use a lot of VMs on the client side to do it. We
know the contention isn't in the connection pool as we see the problem even when we have more
connections than threads. Unfortunately, the times are spiraling out of control very quickly.
We need it to support at least 128 threads in practice, but most important I want to support
500 updates/sec and 250 gets/sec. In theory, this should be a piece of cake for the cluster
as we can do FAR more work than that with a few VMs, but we don't even get close to this with
a single VM.

So my question: how do people building high-performance apps with HBase get around this? What
approach are others using for connection pooling in a multi-threaded environment? There seems
to be a surprisingly little amount of info about this on the web considering the popularity.
Is there some client setting we need to use that makes it perform better in a threaded environment?
We are going to try to cache HTable instances next but that's a total guess. There are solutions
to offloading work to other VMs but we really want to avoid this as clearly the cluster can
handle the load and it will dramatically decrease the application performance in critical

Any help is greatly appreciated! Thanks!

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message