lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Realtime Search
Date Sat, 24 Jan 2009 12:29:06 GMT
Jason Rutherglen wrote:

> > "But I think for realtime we don't want to be using IW's deletion at
> all.  We should do all deletes via the IndexReader.  In fact if IW has
> handed out a reader (via getReader()) and that reader (or a reopened
> derivative) remains open we may have to block deletions via IW.  Not
> sure..."
>
> Can't IW use the IR to do it's deletions?  Currently deletions in IW  
> are implemented in DocumentsWriter.applyDeletes by loading a segment  
> with SegmentReader.get() and making the deletions which causes term  
> index load overhead per flush.  If IW has an internal IR then the  
> deletion process can use it (not SegmentReader.get) and there should  
> not be a conflict anymore between the IR and IW deletion processes.

Today, IW quickly opens each SegmentReader, applies deletes, then
commits & closes it, because we have considered it too costly to leave
these readers open.

But if you've opened a persistent IR via the IndexWriter anyway, we
should use the SegmentReaders from that IR instead.

It seems like the joint IR+IW would allow you to do adds, deletes,
setNorms, all of which are not visible in the exposed IR until
IR.reopen is called.  reopen would then flush any added docs to new
segments, materialize any buffered deletes into the BitVectors (or
future transactional sorted int tree thingy), likewise for norms, and
then return a new IR.

Ie, the IR becomes transactional as well -- deletes are not visible
immeidately until reopen is called (unlike today when you delete via
IR).  I think this means, internally when IW wants to make changes to
the shared IR, it should make a clone() and do the changes privately
to that instance.  Then when reopen is called, we must internally
reopen that clone() such that its deleted docs are carried over to the
newly reopened reader and newly flushed docs from IW are visible as
new SegmentReaders.

And on reopen, the deletes should not be flushed to the Directory --
they only need to be "moved" into each SegmentReader's deletedDocs.
We'd also need to ensure when a merge kicks off, the SegmentReaders
used by the merging are not newly reopened but also "borrowed" from
the already open IR.  This could actually mean that some deleted docs
get merged away before the deletions ever get flushed to the Directory.

> > "we may have to block deletions via IW"
>
> Hopefully they can be buffered.
>
> Where else does the write lock need to be coordinated between IR and  
> IW?
>
> > "somehow IW & IR have to "split" the write lock else we may
> need to merge deletions somehow."
>
> This is a part I'd like to settle on before start of  
> implementation.  It looks like in IW deletes are buffered as terms  
> or queries until flushed.  I don't think there needs to be a lock  
> until the flush is performed?
>
> For the merge changes to the index, the deletionpolicy can be used  
> to insure a reader still has access to the segments it needs from  
> the main directory.

The write lock is held to prevent multiple writers from buffering and
then writing changes to the index.  Since we will have this joint
IR/IW share state, as long as we properly synchronize/share things
between IR/IW, it's fine if they both "share" the write lock.

It seems like IR.reopen suddenly means "have IW materialize all
pending stuff and give me a new reader", where stuff is adds &
deletes.  Adds must materialize via the directory.  Deletes can
materialize entirely in RAM.  Likewise for norms.

When IW.commit is called, it also then asks each SegmentReader to
commit.  Ie, IR.commit would not be used.

> > "We have to test performance to measure the net add -> search  
> latency.
> For many apps this approach may be plenty fast.  If your IO system is
> an SSD it could be extremely fast.  Swapping in RAMDir
> just makes it faster w/o changing the basic approach."
>
> It is true that this is best way to start and in fact may be good  
> enough for many users.  It could help new users to expose a reader  
> from IW so the delineation between them is removed and Lucene  
> becomes easier to use.
>
> At the very least this system allows concurrently updateable IR and  
> IW due to sharing the write lock something that has is currently  
> incorrect in Lucene.

I wouldn't call it "incorrect".  It was an explicit design tradeoff to
make the division between IR & IW, and done for many good reasons.  We
are now talking about relaxing that and it clearly raises a number of
"challenging" issues...

> > "Besides the transaction log (for crash recovery), which should fit
> "above" Lucene nicely, what else is needed for realtime beyond the
> single-transaction support Lucene already provides?"
>
> What we have described above (exposing IR via IW) will be sufficient  
> and realtime will live above it.

OK, good.

In this model, the combined IR+IW is still jointly transactional, in
that the IW's commit() method still behaves as it does today.  It's just
that the IR that's linked to the IW is allowed to "see" changes, shared
only in RAM, that a freshly opened IR on the index would not see until
commit has been called.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message