hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <apurt...@apache.org>
Subject Re: HBase region size
Date Fri, 01 Jul 2011 08:23:41 GMT
> From: Stack <stack@duboce.net>

>>  3. The size of them varies like this
>>            70% from them have their length < 1MB
>>            29% from them have their length between 1MB and 10 MB
>>            1% from them have their length > 10MB (they can have also 
> 100MB)
> What David says above though Jack in his yfrog presentation today
> talks of storing all images in hbase up to 5MB in size.
> Karthick in his presentation at hadoop summit talked about how once
> cells cross a certain size -- he didn't saw what the threshold was I
> believe -- then only the metadata is stored in hbase and the content
> goes to their "big stuff" system.
> Try it I'd say.  If only a few instances of 100MB, HBase might be fine.

I've seen problematic behavior in the past, if you store values larger than 100 MB and then
do concurrent scans over table(s) containing many such objects. The default KeyValue size
limit is 10 MB. This is usually sufficient. For webtable-like applications I may raise it
to 50 MB, and larger objects are not interesting anyway (to me).

One reasonable way to handle native storage of large objects in HBase would be to introduce
a layer of indirection. Break the large object up into chunks. Store the chunks in a manner
that gets good distribution in the keyspace, maybe by SHA-1 hash of the content. Then store
an index to the chunks with the key of your choice. Get the key to retrieve the index, then
use a MultiAction to retrieve the referenced chunks in parallel. Given large objects you are
going to need a number of round trips over the network to pull all of the data anyway. Adding
a couple more in the front may not cause the result to fall outside the performance bound
of your application.

However you will put your client under heap pressure that way, as objects in HBase are fully
transferred at once to the client in the RPC response. Another option is to store large objects
directly into HDFS and keep only the path to it in HBase. A benefit of this approach is you
can stream the data out of HDFS with as little or as much buffering in your application as
you would like.
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)

View raw message