hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Juhani Connolly <juh...@ninja.co.jp>
Subject Re: Efficient mass deletes
Date Mon, 05 Apr 2010 06:13:40 GMT
Currently it is just something I expect to run into problems with as I 
am yet some ways from going load testing though I'd hope to get started 
on it soon. The 0.21 planned implementation of MultiDelete will 
certainly help a lot though.
Perhaps running a M/R job with a scan result as the input that deletes a 
range on each task could be an efficient way to do these kinds of mass 
deletes?

On 04/03/2010 01:26 AM, Jonathan Gray wrote:
> Juhani,
>
> Deletes are really special versions of Puts (so they are equally fast).  I suppose it
would be possible to have some kind of special filter that issued deletes server-side but
seems dangerous :)  That's beyond even the notion of stateful scanners which are tricky as
is.
>
> MultiDelete would actually process those deletes in parallel, concurrently running across
all the servers, so is a bit more than just List<Delete>  under the covers.  Or at least
that's the intention, I don't think it's built.
>
> Are you running into performance issues doing the deletes currently, or are you just
expecting to run into problems?  I would think that if it was taking too long to run from
a sequential client, a parallel MultiDelete would solve your problems.
>
> JG
>
>    
>> -----Original Message-----
>> From: Juhani Connolly [mailto:juhani@ninja.co.jp]
>> Sent: Thursday, April 01, 2010 10:44 PM
>> To: hbase-user@hadoop.apache.org
>> Subject: Efficient mass deletes
>>
>> Having an issue with table design regarding how to delete old/obsolete
>> data.
>>
>> I have raw names in a non-time sorted manner, id first followed by
>> timestamp, the main objective being running big scans on specific id's
>> from time x to time y.
>>
>> However this data builds up at a respectable rate and I need a method
>> to
>> delete old records en masse. I considered using the ttl parameter on
>> the
>> column families, but the current plan is to selectively store data for
>> a
>> longer time for specific id's.
>>
>> Are there any plans to link a delete operation with a scanner(so delete
>> range x-y, or if you supply a filter, delete when conditions p and q
>> are
>> met).
>>
>> If not what would be the recommended method to handle these kind of
>> batch deletes?
>> The current JIRA for MultiDelete (
>> http://issues.apache.org/jira/browse/HBASE-1845 )  simply implements
>> deleting on a List<Delete>, which still seems limited.
>>
>> Is the only way to do this to run a scan, and then build a List from
>> that to use with the multi call discussed in HBASE-1845? This feels
>> very
>> inefficient but please correct me if I'm mistaken. Current activity
>> estimate is about 10million rows a day, generating about 300million
>> cells, which would need to be deleted on a regular basis(so 300mil
>> cells
>> every day or 2.1bil once a week)
>>      
>    


Mime
View raw message