hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Rutherglen <jason.rutherg...@gmail.com>
Subject Re: HBase and Lucene for realtime search
Date Sun, 13 Feb 2011 23:37:29 GMT
> Google's percolator paper.

Can you post a link?

> Another issue is that maybe the scalability needs for search might be
> different. An HBase region is always only active in one region server, there
> are no active replica's, while often for search you need replicas to scale,
> since a search will typically hit all partitions.

Really?  That seems odd.

> I assume you don't really need ACID transactions, but only the guarantee
> that when you update an HBase row, its index will eventually be updated too?
> (possibly with a little "RT" delay).

While not "needed" it's definitely a worthy goal?  Eg, with the newer
RT functionality in Lucene this'll be more or less be available out of
the box, with hopefully no delay.

> If it fails anywhere in between, one can always replay from the WAL. If you
> add a write-ahead-log just to e.g. Katta, that won't help yet with the
> consistency across the systems

Right, I think this's a real problem.  My guess is it'll be easier to
develop a scalable RT search system around HBase, then separate it out
if it's possible/needed.

> to be the main action and all what follows just secondary side-effects (i.e.
> there's no rollback).

I think inside a Coprocessor you could block the HBase 'commit' until
a successful updateDoc call to Lucene (which is only an update to RAM
anyways)?

> That would definitely be interesting, but I guess for it to work with good
> performance the ordering of the HBase row keys should be the same as that of
> the Lucene doc IDs

That'd be ideal, and/or being able to write the HBase key value file
pointer into Lucene, though that seems a little far fetched.

On Sun, Feb 13, 2011 at 5:13 AM, Bruno Dumon <bruno@outerthought.org> wrote:
> On Sat, Feb 12, 2011 at 10:31 PM, Jason Rutherglen <
> jason.rutherglen@gmail.com> wrote:
>
>> Right, the concepts aren't that hard (write ahead log etc), however to
>> keep the data transactionally consistent with another datastore across
>> servers [I believe] is a little more difficult?
>
>
> I assume you don't really need ACID transactions, but only the guarantee
> that when you update an HBase row, its index will eventually be updated too?
> (possibly with a little "RT" delay).
>
> [As you probably know, ] the basic solution to do this across systems is a
> write-ahead-log outside of these systems, i.e. the sequence to perform an
> update would be:
>  (1) write update to the WAL
>  (2) perform update on HBase
>  (3) perform update on Lucene
>
> If it fails anywhere in between, one can always replay from the WAL. If you
> add a write-ahead-log just to e.g. Katta, that won't help yet with the
> consistency across the systems, as it could fail between doing the update to
> HBase and writing to the Katta-WAL.
>
> We do have something like this in Lily (http://lilyproject.org, check the
> 'rowlog' thing), though it is somewhat different than above; to the "WAL" we
> only write the ID of the row, since we consider the update to the HBase row
> to be the main action and all what follows just secondary side-effects (i.e.
> there's no rollback).
>
> Slightly similar ideas can be found in Google's percolator paper.
>
>
>>  Also with RT there
>> needs to be a primary data store somewhere outside of Lucene,
>> otherwise we'd be storing the same data twice, eg, in HBase and
>> Lucene, that's inefficient.  I'm guessing it'll be easier to keep
>> Lucene indexes in parallel with HBase regions across servers, and then
>> use the Coprocessor architecture etc, to keep them in'sync, on the
>> same server.  When a region is split, we'd need to also split the
>> Lucene index, this'd be the only 'new' technology that'd need to be
>> created on the Lucene side.
>>
>
> That would definitely be interesting, but I guess for it to work with good
> performance the ordering of the HBase row keys should be the same as that of
> the Lucene doc IDs (so that posting lists can be split in the middle rather
> than having to rearrange everything), and I don't see how that could be the
> case.
>
> Another issue is that maybe the scalability needs for search might be
> different. An HBase region is always only active in one region server, there
> are no active replica's, while often for search you need replicas to scale,
> since a search will typically hit all partitions.
>
> --
> Bruno Dumon
> Outerthought
> http://outerthought.org/
>

Mime
View raw message