hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahul Ravindran <rahu...@yahoo.com>
Subject Using Hbase for Dedupping
Date Thu, 14 Feb 2013 19:23:41 GMT
   We have events which are delivered into our HDFS cluster which may be duplicated. Each
event has a UUID and we were hoping to leverage HBase to dedupe them. We run a MapReduce job
which would perform a lookup for each UUID on HBase and then emit the event only if the UUID
was absent and would also insert into the HBase table(This is simplistic, I am missing out
details to make this more resilient to failures). My concern is that doing a Read+Write
for every event in MR would be slow (We expect around 1 Billion events every hour). Does anyone
use Hbase for a similar use case or is there a different approach to achieving the same end
result. Any information, comments would be great.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message