hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anoop Sam John <anoo...@huawei.com>
Subject RE: Using Hbase for Dedupping
Date Fri, 15 Feb 2013 03:55:40 GMT
Hi Rahul
             When you say that some events can come with duplicate UUID, what is the probability
of such duplicate events?  Is it like most of the events wont be unique and only few are duplicate?
 Also whether this same duplicated events come again and again (I mean same UUID for so many

From: Rahul Ravindran [rahulrv@yahoo.com]
Sent: Friday, February 15, 2013 12:53 AM
To: user@hbase.apache.org
Subject: Using Hbase for Dedupping

   We have events which are delivered into our HDFS cluster which may be duplicated. Each
event has a UUID and we were hoping to leverage HBase to dedupe them. We run a MapReduce job
which would perform a lookup for each UUID on HBase and then emit the event only if the UUID
was absent and would also insert into the HBase table(This is simplistic, I am missing out
details to make this more resilient to failures). My concern is that doing a Read+Write for
every event in MR would be slow (We expect around 1 Billion events every hour). Does anyone
use Hbase for a similar use case or is there a different approach to achieving the same end
result. Any information, comments would be great.

View raw message