hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Nguyen <andrew-lists-hb...@ucsfcti.org>
Subject Re: HBase minimum block size for sequential access
Date Tue, 27 Jul 2010 17:09:07 GMT

Thanks for the heads up.  Do you know what happens if I set this value larger than 5MB?  We
will always be scanning the data, and always in large blocks.  I have yet to calculate the
typical size of a single scan but imagine that it will usually be larger than 1MB.

Also, is there any way to change the block size with data already in HBase?  Our current import
process is very slow (preprocessing of the data) and we don't have the resources to store
the preprocessed data.  


On Jul 27, 2010, at 9:30 AM, Jean-Daniel Cryans wrote:

> Ryan (who wrote HFile) did a lot of testing around block size and
> didn't really see any difference when changing it. So I would
> recommend that you benchmark different values with your own data/usage
> pattern and see if you do have better/worse perfs.
> The tradeoff for larger values is that in order to retrieve a single
> cell, you would have to fetch a lot more data than required eg if your
> cell is 5KB and your block size is 1MB, that's how much you need to
> get on the network in order to read it. Obviously if you are scanning,
> then you probably want all that data anyways so larger values *
> theoretically* gives you better performance.
> J-D
> On Mon, Jul 26, 2010 at 10:41 PM, Andrew Nguyen
> <andrew-lists-hbase@ucsfcti.org> wrote:
>> I found the following snippet in the HFile javadocs and had some questions seeking
clarification.  The recommendation is a minimum block size between 8KB and 1MB with larger
for sequential accesses.  Our data are time series data (high resolution, sampled at 125Hz).
 The primary/typical access pattern are subsets of the data, anywhere from 37k points to millions
of points.
>> Should I be setting this to 1MB?  Would even larger values be a good idea (i.e. greater
than 1MB)?  What are the tradeoffs for larger values?
>> From the HFile javadocs:
>> Minimum block size. We recommend a setting of minimum block size between 8KB to 1MB
for general usage. Larger block size is preferred if files are primarily for sequential access.
However, it would lead to inefficient random access (because there are more data to decompress).
Smaller blocks are good for random access, but require more memory to hold the block index,
and may be slower to create (because we must flush the compressor stream at the conclusion
of each data block, which leads to an FS I/O flush). Further, due to the internal caching
in Compression codec, the smallest possible block size would be around 20KB-30KB.
>> Thanks!
>> --Andrew

View raw message