hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <apurt...@apache.org>
Subject Re: data duplicate?
Date Fri, 28 Nov 2008 09:50:04 GMT
Chu,

There is no uniqueness test performed when data is stored
into a cell. If your schema allows multiple versions and
you store the same data into the cell more than once at
different times, you will get back in response to queries
the "duplicates" such as you presented.

If you are trying to avoid duplicates, use a row key that
uniquely identifies an object (such as a SHA-1 hash) and
set MAX_VERSIONS on the column that should contain only
one canonical entry to 1. Then if you store the same data
item more than once, a replacement will happen instead of
an addition.

Hope this helps,

   - Andy

> From: 鞠適存 <chihchun.chu@gmail.com>
> Subject: data duplicate?
> To: hbase-user@hadoop.apache.org
> Date: Thursday, November 27, 2008, 7:31 PM
> Hi,
> 
> I revised the sample code "Bulk Import" written
> by Allen Day to upload a
> flat data file to a hbase table.
> My table schema is designed as: <row key>
> <ColFamily1:colKey> <ColFamily2:
> colkey>.
> The table description found by hbase shell is as follow:
> {NAME => 'ATCGeo', IS_ROOT =>
> 'false', IS_META => 'false', FAMILIES
> =>
> [{NAME => 'photo_id', BLOOMFILTER => 'f
> alse', VERSIONS => '30000', COMPRESSION
> => 'NONE', LENGTH => '2147483647',
> TTL => '-1', IN_MEMORY => 'true', B
> LOCKCACHE => 'true'}, {NAME =>
> 'trail_id', BLOOMFILTER => 'false',
> VERSIONS
> => '30000', COMPRESSION => 'NONE',
>  LENGTH => '2147483647', TTL => '-1',
> IN_MEMORY => 'true', BLOCKCACHE =>
> 'true'}]}
> 
> Some of the data was been found as duplicate-with the same
> content but the
> different timestamp. For example,
> I use the: get '<table>',
> '<rowkey>',{COLUMN=>'col1',VERSION=>30000}
> the results are:
> timestamp=3090896685592411,
> value=/media/streetimage/processed/streettester/2008_08_07_12_26_21_C/2265.jpg
> 
> timestamp=3090896682597411,
> value=/media/streetimage/processed/streettester/2008_08_07_12_26_21_C/2264.jpg
> 
> timestamp=3090731558521386,
> value=/media/streetimage/processed/streettester/2008_08_07_12_26_21_C/2265.jpg
> 
> timestamp=3090731556503386,
> value=/media/streetimage/processed/streettester/2008_08_07_12_26_21_C/2264.jpg
> 
> I am sure that the data in original file is unique. Could
> anyone tell me what's the possible reasons?
> Would appreciate any help!
> 
> Chu


      

Mime
View raw message