hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anoop Sam John <anoo...@huawei.com>
Subject RE: bulk deletes
Date Mon, 08 Oct 2012 03:55:58 GMT
We also done an implementation using compaction time deletes(avoid KVs). This works very well
for us....
As this would delay the deletes to happen till the next major compaction, we are having an
implementation to do the real time bulk delete. [We have such use case]
Here I am using an endpoint implementation to do the scan and delete at the server side only.
Just raised an IA for this [HBASE-6942].  I will post a patch based on 0.94 model there...Pls
have a look....  I have noticed big performance improvement over the normal way of  scan()
+ delete(List<Delete>) as this avoids several network calls and traffic...

-Anoop-
________________________________________
From: lars hofhansl [lhofhansl@yahoo.com]
Sent: Saturday, October 06, 2012 1:09 AM
To: user@hbase.apache.org
Subject: Re: bulk deletes

Does it work? :)

How did you do the deletes before?I assume you used the HTable.delete(List<Delete>)
API?

(Doesn't really help you, but) In 0.92+ you could hook up a coprocessor into the compactions
and simply filter out any KVs you want to have removed.


-- Lars



________________________________
 From: Paul Mackles <pmackles@adobe.com>
To: "user@hbase.apache.org" <user@hbase.apache.org>
Sent: Friday, October 5, 2012 11:17 AM
Subject: bulk deletes

We need to do deletes pretty regularly and sometimes we could have hundreds of millions of
cells to delete. TTLs won't work for us because we have a fair amount of bizlogic around the
deletes.

Given their current implemention  (we are on 0.90.4), this delete process can take a really
long time (half a day or more with 100 or so concurrent threads). From everything I can tell,
the performance issues come down to each delete being an individual RPC call (even when using
the batch API). In other words, I don't see any thrashing on hbase while this process is running
– just lots of waiting for the RPC calls to return.

The alternative we came up with is to use the standard bulk load facilities to handle the
deletes. The code turned out to be surpisingly simple and appears to work in the small-scale
tests we have tried so far. Is anyone else doing deletes in  this fashion? Are there drawbacks
that I might be missing? Here is a link to the code:

https://gist.github.com/3841437

Pretty simple, eh? I haven't seen much mention of this technique which is why I am a tad paranoid
about it.

Thanks,
Paul
Mime
View raw message