hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anoop Sam John <anoo...@huawei.com>
Subject RE: Using HBase for Deduping
Date Fri, 15 Feb 2013 09:43:34 GMT
Or may be go with large value for max version and put the duplicate entry. Now in the compact,
need to have a wrapper for InternalScanner and next() method return only the 1st KV out, removing
the others...  Even while scan also same kind of logic will be needed..  This will be good
enough IMO especially when there wont be so many duplicate events for same rowkey..  That
is why I asked some questions before....

I think this solution can be checked.

From: Asaf Mesika [asaf.mesika@gmail.com]
Sent: Friday, February 15, 2013 3:06 PM
To: user@hbase.apache.org
Cc: Rahul Ravindran
Subject: Re: Using HBase for Deduping

Then maybe he can place an event in the same rowkey but with a column
qualifier which the time stamp of the event saved as long. Upon preCompact
in a region observer he can filter out for any row all column but the first?

On Friday, February 15, 2013, Anoop Sam John wrote:

> When max versions set as 1 and duplicate key is added, the last added will
> win removing the old.  This is what you want Rahul?  I think from his
> explanation he needs the reverse way
> -Anoop-
> ________________________________________
> From: Asaf Mesika [asaf.mesika@gmail.com <javascript:;>]
> Sent: Friday, February 15, 2013 3:56 AM
> To: user@hbase.apache.org <javascript:;>; Rahul Ravindran
> Subject: Re: Using HBase for Deduping
> You can load the events into an Hbase table, which has the event id as the
> unique row key. You can define max versions of 1 to the column family thus
> letting Hbase get rid of the duplicates for you during major compaction.
> On Thursday, February 14, 2013, Rahul Ravindran wrote:
> > Hi,
> >    We have events which are delivered into our HDFS cluster which may be
> > duplicated. Each event has a UUID and we were hoping to leverage HBase to
> > dedupe them. We run a MapReduce job which would perform a lookup for each
> > UUID on HBase and then emit the event only if the UUID was absent and
> would
> > also insert into the HBase table(This is simplistic, I am missing out
> > details to make this more resilient to failures). My concern is that
> doing
> > a Read+Write for every event in MR would be slow (We expect around 1
> > Billion events every hour). Does anyone use Hbase for a similar use case
> or
> > is there a different approach to achieving the same end result. Any
> > information, comments would be great.
> >
> > Thanks,
> > ~Rahul.
View raw message