hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Gray <jg...@facebook.com>
Subject Re: Performance of reading rows with a large number of columns
Date Sun, 04 Apr 2010 15:20:37 GMT
It's likely not the actual deserialization itself but rather the time  
to read the entire row from hdfs.  There are some optimizations that  
can be made here (using block index to get all blocks for a row with a  
single hdfs read, tcp socket reuse, etc)

On Apr 3, 2010, at 11:35 AM, "Sammy Yu" <syu@brightedge.com> wrote:

> Hi Andy,
>   Thanks for the response.  I realize that the data is serialized at  
> the
> (row, column family) level, but ideally I would like to have a  
> single column
> family representing a sorted list of items (up to a million) so the  
> schema
> would look as such:
>
> table: products
> column family name:sortedbyprice
> row key: storea:YYYYMM
>   sortedbyprice:value0001-0100: {product1, product2, product3, ....}
>   sortedbyprice:value0100-0101: {product101, product102,  
> product103, ....}
>   ....
>   sortedbyprice:value5000-5101: {product5000, product5102,  
> product5103,
> ....}
>
>
> It seems like a viable solution would be to move column  
> value0001-0100 into
> the row key, however there seems to be a performance penalty in this
> approach when you have to cross multiple rows albeit much less.  Are  
> there
> other possible schemas that might be more suitable?  Also is the  
> jump in
> query time purely related to deserialization?
>
> Best,
> Sammy
>
>
> On Sat, Apr 3, 2010 at 10:45 AM, Andrew Purtell  
> <apurtell@apache.org> wrote:
>
>> Sammy,
>>
>>> Is HBase deserializing the entire row when it reads the
>>> data from disk
>>
>> No.
>>
>>> so limiting the column doesn't have any effect.
>>
>> HBase is a column oriented store -- values are grouped  
>> independently at the
>> store level by column family.
>>
>> It appears you are using only one column family, "cf1:". You need to
>> distinguish between column families and columns (family:qualifier).  
>> When
>> processing rows, all values in the column family (all  
>> family:qualifier in
>> the column) will be read in from disk.
>>
>> To get the effect you desire, you need a schema more like:
>>
>> table:row1: {
>>  columfamily:cf1, column:value0001-0100: <cell value>,
>>   columfamily:cf2, column:value0101-0200: <cell value>,
>>  columfamily:cf3, column:value0201-0300: <cell value>,
>>  ....
>> }
>>
>> Then you can specify in gets and scans what column: or  
>> column:qualifier to
>> include in the result, and I/O will be performed only on the column  
>> families
>> selected by your get or scan.
>>
>> Hope that helps,
>>
>>  - Andy
>>
>>
>>> From: Sammy Yu
>>> Subject: Performance of reading rows with a large number of columns
>>> To: hbase-user@hadoop.apache.org
>>> Date: Saturday, April 3, 2010, 12:41 AM
>>> Hi,
>>>   We've been doing some performance
>>> comparison between different sets of
>>> schema on HBase-0.20.3.  I have a schema defined as
>>> such
>>>
>>> table:row1: {
>>>   columfamily:cf1, column:value0001-0100: <cell value>,
>>>   columfamily:cf1, column:value0101-0200: <cell value>,
>>>   columfamily:cf1, column:value0201-0300: <cell value>,
>>>   ....
>>> }
>>>
>>> Using the thrift protocol, we are using scannerOpen and
>>> limiting it by specifying just a single column such as
>>> cf1:value0101-0200.  This works
>>> really well when row1 just has a single column (0.040
>>> seconds).  However
>>> when a row contains 5,000 columns, the query time jumps up
>>> to 1.8 seconds.
>>> Is HBase deserializing the entire row when it reads the
>>> data from disk so
>>> limiting the column doesn't have any effect.  Also, is
>>> the solution is then
>>> to move the column so that it becomes part of the
>>> key?  I think this
>>> solution will work, however it doesn't feel right as there
>>> could be cases
>>> where I want value0101-0200 and value0101-0200 to come back
>>> in one row.
>>>
>>> Thanks,
>>> Sammy
>>>
>>
>>
>>
>>
>>
>
>
> -- 
> Chief Architect, BrightEdge
> email: syu@brightedge.com   |   mobile: 650.539.4867  |   fax: 650.521.9678
> |  address: 1850 Gateway Dr Suite 400, San Mateo, CA 94404

Mime
View raw message