lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley" <>
Subject Re: Unique doc ids
Date Thu, 24 Jan 2008 13:30:08 GMT
On Jan 24, 2008 5:47 AM, Michael McCandless <> wrote:
> Yonik Seeley wrote:
> > On Jan 23, 2008 6:34 AM, Michael McCandless
> > <> wrote:
> >>    writer.freezeDocIDs();
> >>    try {
> >>      get docIDs from somewhere & call writer.deleteByDocID
> >>    } finally {
> >>      writer.unfreezeDocIDs();
> >>    }
> >
> > Interesting idea, but would require the IndexWriter to flush the
> > buffered docs so an IndexReader could be created fro them.  (or would
> > require the existence of an UnflushedDocumentsIndexReader)
> True.
> Actually, an UnflushedDocumentsIndexReader would not be hard!
> DocumentsWriter already has an IndexInput (ByteSliceReader) that can
> read the postings for a single term from the RAM buffer (this is used
> when flushing the segment).  I think it'd be straightforward to get
> TermEnum/TermDocs/TermPositions iterators on the buffered docs.
> Norms are already stored as byte arrays in memory.  FieldInfos is
> already available.  The stored fields & term vectors are already
> flushed to the directory so they could be read normally.
> Hmm, buffered delete terms are tricky.  I guess freezeDocIDs would
> have to flush deleted terms (and queries, if we add that) before
> making a reader accessible,

If we buffer queries, that would seem to take care of 99% of the
usecases that need an IndexReader, right?   A custom query could get
ids from an index however it wanted.

> though, the cost is shared because the
> readers need to be opened anyway (so the app can find docIDs).
> So maybe this approach becomes this:
>    // Returns a "point in time" frozen view of index...
>    IndexReader reader = writer.getReader();
>    try {
>      <get docIDs from reader, delete by docID>
>   } finally {
>      writer.releaseReader();
>    }
> ?
> We may even be able to implement this w/o actually freezing the
> writer,
> ie, still allowing add/updateDocument calls to proceed.
> Merging could certainly still proceed.  This way you could at any
> time ask a writer for a "point in time" reader, independent of what
> else you are doing with the writer.  This would require, on flushing,
> that writer goes and swaps in a "real" segment reader, limited to a
> specified docID, for any point in time readers that are open.

Wow... sounds complex.

> >> If we went that route, we'd need to expose methods in IndexWriter to
> >> let you get reader(s), and, to then delete by docID.
> >
> > Right... I had envisioned a callback that was called after a new
> > segment was created/flushed that passed IndexReader[].  In an
> > environment of mixed deletes and adds, it would avoid slowing down the
> > indexing part by limiting where the deletes happen.
> This would certainly be less work :)  I guess the question is how
> severely are we limiting the application by requiring that you can
> only do deletes when IW decides to flush, or, by forcing the
> application to flush when it wants to do deletes.

Seems like more work, rather than limiting... "when" really isn't as
important as long as it's before a new external IndexReader is opened
for searching.

> > It does put a little more burden on the user, but a slightly harder
> > (but more powerful / more efficient) API is preferable since easier
> > APIs can always be built on top (but not vice-versa).
> True, though emulating the easier API on top of the "you get to
> delete only when IW flushes" means you are forcing a flush, right?

I was thinking via buffering (the same way term deletes are handled now).
You keep track of maxDoc() at the time of the delete and defer it until later.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message