hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jim Kellerman <...@powerset.com>
Subject RE: Is the latest version of Hbase support multiple updates on same row at the same time?
Date Thu, 17 Apr 2008 15:58:55 GMT
> -----Original Message-----
> From: news [mailto:news@ger.gmane.org] On Behalf Of Zhou
> Sent: Thursday, April 17, 2008 7:44 AM
> To: hbase-user@hadoop.apache.org
> Subject: Re: Is the latest version of Hbase support multiple
> updates on same row at the same time?

<snip>

> I've look at the source code of BatchUpdate class.
> I believe it collects update operations of one specified row.
> And submit to the HRegionServer which locates this row via
> one RPC call.
> Am I right?

Correct.

> So in one server, I can actually cache all updates of a
> specified row to one BatchUpdate object. And it might work
> for one process on one server.
> However, how about multiple processes running concurrently on
> different servers?

I'm not sure what you mean by server, but any particular row is only served
by one HBase server. Multiple clients can submit batch updates for the
same row and they will all be handled by a single HBase server.

> Each of them have one BatchUpdate class of their own. I doubt
> it would still cause the "update in progress" exception.

In 0.16 (and also in the hbase-0.1.x releases) the client API
supports only one batch update operation at a time. So if a single
thread did two startUpdate calls or if multiple threads did a
single startUpdate call, you will get the "update in progress"
exception.

This has changed in HBase trunk. A single thread or multiple
threads can create a separate BatchUpdate object for each row
they want to update. When all the changes have been added to
the BatchUpdate, it is sent to the server by calling
HTable.commit(BatchUpdate)

> Even though I assume it works, since one row one BatchUpdate
> object, if I have millions of rows, I would have to create
> millions of object.
> I don't think it is workable.

BatchUpdate objects are very inexpensive. The largest part of
any batch update are the column values for put operations.

> And how many batch operations should I cached in the
> BatchUpdate object before commit?

As many as you want to, provided they are for the same row.

> What if the updates requires immediate Durability requirement
> (D in ACID)?

Not sure I understand the problem. The updates collected in
a BatchUpdate are sent via a single RPC call. The row gets
locked on the server and each update is written to the redo
log before it is cached. When the cache fills it is flushed
to disk. If the server crashes before the cache is flushed,
the data can be recovered from the redo log.

> I believe It is better to solve the concurrent update problem
> at the server-side.

And that is exactly what happens in HBase trunk. HBase 0.16 and
hbase-0.1.x do not do that as you have discovered.

> BatchUpdate would not work at lest for massive size of data
> or high load.

Actually it works pretty well. We have several applications that
have tens of millions of rows on 10 to 20 servers that are storing
tens of gigabytes of data currently.

One user loaded 1.3 billion rows into HBase as a test.

> I hope HBase could fix the problem in the near future.

It is fixed in hbase trunk which has not yet been released.

> Is any version of HBase allows concurrent updates while what
> we need to do is only type table.commit(id)?

There is no released version that supports this. It is only
in hbase trunk which will be released as hbase-0.2.0 in a
few weeks.

By the way, you know that HBase is now a subproject of
Hadoop and now has a separate svn repository? All development
of hbase-0.1.x and hbase-trunk happens there and not in
the hadoop svn. You can find the hbase source at:

http://svn.apache.org/repos/asf/hadoop/hbase


No virus found in this outgoing message.
Checked by AVG.
Version: 7.5.524 / Virus Database: 269.23.0/1383 - Release Date: 4/17/2008 9:00 AM


Mime
View raw message