hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Czech <eczec...@gmail.com>
Subject Re: Indexing w/ HBase
Date Fri, 12 Oct 2012 13:34:30 GMT
Hi Michael,

1) The indexes for me specifically are timeseries grouped by different
properties of the raw data.  What tends to change though is what those
"properties" really should be.  For example, we might have a timeseries
that's indexed by a country and zip code that was created from a latitude /
longitude pair in the raw data and then find that the conversion there was
incorrect (i.e. it should have been a different country + zip code for the
latitude + longitude pair).

2) I definitely don't want to store the indexes outside of HBase -- I just
want a way to be able to rebuild parts of the HBase index using the same
source HDFS files and different index building "logic" (generally different
auxiliary lookup maps but other things could change too) without making the
current version of that part of the index unavailable.

Does that answer your questions?

On Fri, Oct 12, 2012 at 9:00 AM, Michael Segel <michael_segel@hotmail.com>wrote:

> Silly question(s).
> 1) What sort of indexes do you want to build?
> 2) Why would you want to store your indexes outside of HBase?
> (Ok they are not so silly.  But I don't want people to think that I'm
> against the idea, just that its more of an issue of design.)
> -Mike
> On Oct 12, 2012, at 7:03 AM, Eric Czech <eczech52@gmail.com> wrote:
> > Hi everyone,
> >
> > Are there any tools or libraries for managing HDFS files that are used
> > solely for the purpose of creating indexes in HBase?  In other words, is
> > there any way to seamlessly integrate new HDFS files into a periodic
> > MapReduce process that builds indexes and also reprocess those files if
> the
> > index building logic or underlying HDFS files change?
> >
> > I'm looking for something similar to HCatalog but the limitation I find
> > with it is that there's no way to rebuild parts of an index with out
> > deleting the old index entries or having to guarantee that the new index
> > cells will completely overwrite the old ones.
> >
> > Here's an example to better explain:
> >
> > -  Assume I want to build an index in HBase on HDFS files A, B, and C.
> > -  Let's say I build that index with a MapReduce job and then realize
> that
> > one of the auxiliary lookup files used in that job was not completely
> > correct.
> > -  I'd like to rerun the indexing job at this point but it's entirely
> > possible that the new index won't involve all the same cells as the old
> > index.
> > -  Now, I can't delete all the old index entries before running the new
> job
> > since that index may still be in use so there's no obvious way to update
> > the index in isolation
> >
> > The prevailing approach to solving this seems to be continually
> rebuilding
> > the indexes in full and having a way to atomically switch the old indexes
> > out with the new ones.  A better approach might be to do the same thing
> > with a higher granularity and what I'm really asking is whether or not
> > there is any tool that does exactly that.
> >
> > A naive approach at "versioning" like this with higher granularity might
> > simply tie HDFS files to cells in HBase, give that association a version
> > number, and allow clients to only read cells from hbase associated with
> > active versions (as opposed to versions that are currently being inserted
> > into HBase).  Then the "active" version could be incremented at the end
> of
> > a successful MapReduce index build for all files used in that job.
> >
> > If there are no existing tools for something like this, then doing what I
> > mentioned above is probably the route I'll take and I'm very curious to
> > hear if others are facing similar problems and whether or not a tool to
> > solve them would be more widely beneficial.
> >
> > Thank you!

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message