hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Viral Bajaria <viral.baja...@gmail.com>
Subject RE: Using HBase for Deduping
Date Thu, 14 Feb 2013 20:19:00 GMT
Are all these dupe events expected to be within the same hour or they
can happen over multiple hours ?

Viral
From: Rahul Ravindran
Sent: 2/14/2013 11:41 AM
To: user@hbase.apache.org
Subject: Using HBase for Deduping
Hi,
   We have events which are delivered into our HDFS cluster which may
be duplicated. Each event has a UUID and we were hoping to leverage
HBase to dedupe them. We run a MapReduce job which would perform a
lookup for each UUID on HBase and then emit the event only if the UUID
was absent and would also insert into the HBase table(This is
simplistic, I am missing out details to make this more resilient to
failures). My concern is that doing a Read+Write for every event in MR
would be slow (We expect around 1 Billion events every hour). Does
anyone use Hbase for a similar use case or is there a different
approach to achieving the same end result. Any information, comments
would be great.

Thanks,
~Rahul.

Mime
View raw message