hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Re: Using HBase for Deduping
Date Fri, 15 Feb 2013 12:38:33 GMT
But then he can't trigger an event if its a net new row. 

Methinks that he needs to better define the problem he is trying to solve. 
Also the number of events.  A billion an hour or 300K events a second? (Ok its 277.78K events
a second.) 

On Feb 14, 2013, at 10:19 PM, Anoop Sam John <anoopsj@huawei.com> wrote:

> When max versions set as 1 and duplicate key is added, the last added will win removing
the old.  This is what you want Rahul?  I think from his explanation he needs the reverse
> -Anoop-
> ________________________________________
> From: Asaf Mesika [asaf.mesika@gmail.com]
> Sent: Friday, February 15, 2013 3:56 AM
> To: user@hbase.apache.org; Rahul Ravindran
> Subject: Re: Using HBase for Deduping
> You can load the events into an Hbase table, which has the event id as the
> unique row key. You can define max versions of 1 to the column family thus
> letting Hbase get rid of the duplicates for you during major compaction.
> On Thursday, February 14, 2013, Rahul Ravindran wrote:
>> Hi,
>>   We have events which are delivered into our HDFS cluster which may be
>> duplicated. Each event has a UUID and we were hoping to leverage HBase to
>> dedupe them. We run a MapReduce job which would perform a lookup for each
>> UUID on HBase and then emit the event only if the UUID was absent and would
>> also insert into the HBase table(This is simplistic, I am missing out
>> details to make this more resilient to failures). My concern is that doing
>> a Read+Write for every event in MR would be slow (We expect around 1
>> Billion events every hour). Does anyone use Hbase for a similar use case or
>> is there a different approach to achieving the same end result. Any
>> information, comments would be great.
>> Thanks,
>> ~Rahul.

View raw message