hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <apurt...@apache.org>
Subject Re: Performance of reading rows with a large number of columns
Date Sat, 03 Apr 2010 17:45:46 GMT
Sammy,

> Is HBase deserializing the entire row when it reads the
> data from disk

No.

> so limiting the column doesn't have any effect. 

HBase is a column oriented store -- values are grouped independently at the store level by
column family. 

It appears you are using only one column family, "cf1:". You need to distinguish between column
families and columns (family:qualifier). When processing rows, all values in the column family
(all family:qualifier in the column) will be read in from disk.

To get the effect you desire, you need a schema more like:

  table:row1: {
   columfamily:cf1, column:value0001-0100: <cell value>,
   columfamily:cf2, column:value0101-0200: <cell value>,
   columfamily:cf3, column:value0201-0300: <cell value>,
   ....
  }

Then you can specify in gets and scans what column: or column:qualifier to include in the
result, and I/O will be performed only on the column families selected by your get or scan.


Hope that helps,

   - Andy


> From: Sammy Yu
> Subject: Performance of reading rows with a large number of columns
> To: hbase-user@hadoop.apache.org
> Date: Saturday, April 3, 2010, 12:41 AM
> Hi,
>    We've been doing some performance
> comparison between different sets of
> schema on HBase-0.20.3.  I have a schema defined as
> such
> 
> table:row1: {
>    columfamily:cf1, column:value0001-0100: <cell value>,
>    columfamily:cf1, column:value0101-0200: <cell value>,
>    columfamily:cf1, column:value0201-0300: <cell value>,
>    ....
> }
> 
> Using the thrift protocol, we are using scannerOpen and
> limiting it by specifying just a single column such as
> cf1:value0101-0200.  This works
> really well when row1 just has a single column (0.040
> seconds).  However
> when a row contains 5,000 columns, the query time jumps up
> to 1.8 seconds.
> Is HBase deserializing the entire row when it reads the
> data from disk so
> limiting the column doesn't have any effect.  Also, is
> the solution is then
> to move the column so that it becomes part of the
> key?  I think this
> solution will work, however it doesn't feel right as there
> could be cases
> where I want value0101-0200 and value0101-0200 to come back
> in one row.
> 
> Thanks,
> Sammy
> 


      


Mime
View raw message