hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Baranau <alex.barano...@gmail.com>
Subject Re: Can manually remove HFiles (similar to bulk import, but bulk remove)?
Date Mon, 09 Jul 2012 20:05:19 GMT
Hey, this is closer!

However, I think I'd want to avoid major compaction. In fact I was thinking
about avoiding any compactions & splitting.
E.g. say I process some amount of data every 1 hour (e.g. with MR job), the
output is written as a set of HFiles and added to be served by HBase. At
the same time I care to keep only 1 week of data. In that case, ideally,
I'd like to do the following:
* pre-split the table with N regions, to be evenly distributed over the
cluster
* turn off minor/major compactions (it is OK for me to have 24*7 HFiles per
region, given one CF, and I know they will not exceed the region max size)
* periodically remove HFiles older than one week

By setting up table like this, I'd avoid unnecessary split operations,
compact operations, moving Regions (i.e. avoid redundant IO/CPU and
hopefully data locality breaking)

So, you are saying that major compaction will look at max/min ts metainfo
of the HFile and will remove the whole file based on ttl if necessary
(without going through the file)? Can I tell it not to actually compact
other HFiles (i.e. leave them as is, otherwise it would be not as easy to
remove HFiles again in an hour)? I.e. looks like "delete only whole HFiles
based on TTL" functionality is wat I need here..

I fear that complexity with removing HFiles can be caused by (block) cache
that may hold its information. Is that right? I'm actually OK with HBase to
return me the data of files I "deleted" by removing HFiles: I will specify
timerange on scans anyways (in this example to omit things older than 1
week).

Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Solr - Lucene - Hadoop - HBase

On Mon, Jul 9, 2012 at 3:44 PM, Jonathan Hsieh <jon@cloudera.com> wrote:

> You could set your ttls and trigger a major compaction ...
>
> Or, (this is pretty advanced) you can probably do it without taking down
> RS's by:
> 1) closing the region in the hbase shell
> 2) deleting the file in the shell
> 3) reopening the region in the hbase shell
>
> Jon.
>
> On Mon, Jul 9, 2012 at 12:41 PM, Alex Baranau <alex.baranov.v@gmail.com
> >wrote:
>
> > Heh, this is what I want to avoid actually: restarting RSs.
> >
> > Alex Baranau
> > ------
> > Sematext :: http://blog.sematext.com/ :: Solr - Lucene - Hadoop - HBase
> >
> > On Mon, Jul 9, 2012 at 3:38 PM, Amandeep Khurana <amansk@gmail.com>
> wrote:
> >
> > > I _think_ you should be able to do it and be just fine but you'll need
> to
> > > shut down the region servers before you remove and start them back up
> > after
> > > you are done. Someone else closer to the internals can confirm/deny
> this.
> > >
> > >
> > > On Monday, July 9, 2012 at 12:36 PM, Alex Baranau wrote:
> > >
> > > > Hello,
> > > >
> > > > I wonder, for purging old data, if I'm OK with "remove all StoreFiles
> > > which
> > > > are older than ..." way, can I do that? To me it seems like this can
> > be a
> > > > very effective way to remove old data, similar to fast bulk import
> > > > functionality, but for deletion.
> > > >
> > > > Thank you,
> > > >
> > > > Alex Baranau
> > > > ------
> > > > Sematext :: http://blog.sematext.com/ :: Solr - Lucene - Hadoop -
> > HBase
> > > >
> > > >
> > >
> > >
> > >
> >
>
>
>
> --
> // Jonathan Hsieh (shay)
> // Software Engineer, Cloudera
> // jon@cloudera.com
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message