hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <apurt...@apache.org>
Subject Re: HBase region size
Date Fri, 01 Jul 2011 16:34:33 GMT
> > One reasonable way to handle native storage of large objects in HBase would 
> > be to introduce a layer of indirection.
> 
> Do you see this layer on the client or on the server side?


Client side.

> I was also thinking on the "update": Le's say we store a new version of 
> the large object which is smaller than the previous one (less chunks). 
> The previously created chunks will remain for the TimeToLive, but could 
> be potentially removed. The indirection layer would be responsible for 
> this maintenance?


Yes.

> > Store the chunks in a manner that gets good distribution in the keyspace, 
> > maybe by SHA-1 hash of the content.
> 
> An alternative would be to add a "_chunk#" to the original key value.
> I guess you prefer to randomly distribute the chunks in the available 
> regions?


Yes. This will increase the probability that a MultiAction<Get> of the chunks is parallelized
over multiple region servers. That would be beneficial for distributing load, but also if
most or all of the chunks are in the same region -- as would be the case with appending "_chunk#"
to the key -- then performance will suffer because they will be retrieved serially.

> With "index", you mean a list of chunk keys?


Yes.


> > Storing the large ones in HDFS and simply having the pointer in HBase 
> > allows to benefit from HDFS streaming.
> 
> I was wondering if it was already discussed on a StreamingPut
> (StreamingGet)?


The way HBase RPC currently works, it's not possible to stream data out of HBase. The objects
that satisfy your Get or Scanner.next request are marshalled fully into the RPC response,
which is sent all at once.

You could use the HBase REST gateway and therefore stream the response through. In that case
your client side access to the HBase cluster is via your favorite HTTP client library. But
then your actions transit a gateway, which adds latency (and the gateway must buffer the objects
fully in memory), and if addressing resources in a RESTful manner there are HTTP transaction
overheads to consider. This type of configuration would work best for supporting user facing
services that are RESTful in nature themselves: API services, websites.


Best regards,


  - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)


Mime
View raw message