hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Varley <ivar...@salesforce.com>
Subject Re: How to Rank in HBase?
Date Mon, 30 Jan 2012 06:36:16 GMT

HBase uses an approach to structuring its storage known as "Log Structured Merge Trees", which
you can learn more about here:


As well as in Lars George's great book, here:


It does all of these "frequent updates" just in memory, which is very fast; at the same time,
it writes a simple forward-only log of all edits (known as the Write Ahead Log, or WAL) to
disk in order to provide durability in the event of machine failure. It periodically writes
the in-memory data to disk in big immutable ordered chunks, called "store files", which is
very efficient. Future reads of the data then "merge" the on-disk store file data with the
current state in memory, to get the full picture of the state of any row. Over time, the many
small store files get "compacted" into bigger files, so that individual reads don't have too
many files to read from. Each "get" or "scan" operation can just read small blocks of the
store files; when you ask for one record, it doesn't have to read gigabytes of data from the
disk, it can just read a small block. As such, random small reads and writes on a very big
data set can be done efficiently.

Furthermore, it's fine to update the data store frequently. For any given record, you can
make as many updates as you want to the in-memory structures, and these aren't written to
disk until the memory store is flushed (and into the WAL, but that's also efficient b/c it's
ordered by update time, not record key). It all happens in memory, which is very fast (but,
again, it's safe b/c of the WAL). There are even some recent JIRAs that make that process
more efficient, by, for example, HBASE-4241<https://issues.apache.org/jira/browse/HBASE-4241>.

One way to think about it is that HBase is *precisely* a layer that adds these efficient random
read/write capabilities on top of the Hadoop distributed file system (HDFS), and takes care
of doing that in a way that parallelizes nicely across a large cluster of machines, deals
with machine failures, etc.


On Jan 29, 2012, at 10:16 PM, Bing Li wrote:

Dear Stack,

Thanks so much for your reply!

According to my understanding, in a large scale distributed system, it
prefers write-once-read-many. Frequent-updating must bring heavy load for
the consistency issue and the performance must be lowered. HBase must not
be suitable to be updated frequently, right?

Best regards,

On Mon, Jan 30, 2012 at 1:51 PM, Stack <stack@duboce.net<mailto:stack@duboce.net>>

On Sun, Jan 29, 2012 at 12:02 PM, Bing Li <lblabs@gmail.com<mailto:lblabs@gmail.com>>
Another question is whether it is proper to update data in HBase

This is 'normal', yes.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message