lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <>
Subject Re: Removing duplicate entries
Date Tue, 29 Apr 2008 20:15:06 GMT
The first question is what do you mean the document
is already in the index? Lucene doc IDs are useless
here since the ones in your FSDir and the ones in your
RAMdir are unrelated. In fact, I suspect that the
lucene docIDs will start at the same number in both.

Lucene doc IDs are just monotonically incremented integers.

So, how do you identify identical documents? Is there some
field in your document that's guaranteed to be unique to each
document? If so, *that's* the field you can use for termEnum
to get the Lucene docid to remove, assuming you've indexed
it UN_TOKENIZED or you are very, very, very confident that
your tokenizers won't break it up.

But you can make this easier by using IndexReader.deleteDocument(term)
where the term is your unique field.

Additionally, I question why you bother with a RAMdir for your changes.
An index reader essentially takes a snapshot of your index, and
subsequent changes are not seen by your searchers until you
close and reopen the underlying reader. What advantage do you
see in using a RAMdir?


On Tue, Apr 29, 2008 at 1:57 PM, João Rodrigues <> wrote:

> Hello all. Before I ask my question, I'd like to clarify I've read the
> manual and searched the archives, and if I'm here, it is because I've
> neither found a suitable answer, or (most likely) I didn't understand
> those
> which I did find :)
> I have an index built, which I update regularly. However, there may be
> some
> entries, at the time of the update, which are already in the index. As
> such,
> I'd like to remove such entries. I've read about some IndexReader.termEnum
> stuff, but I can't make anything of it =\ I can (and will) rebuild my
> index,
> so I'm looking at a "runtime" approach.
> Here's what I'd like to do: I'm indexing the new documents to a RAM Index.
> Then, I merge it with the FS one, already on the hard drive and containing
> all the old documents. Is there a way to check if a given document is in
> the
> index, and if it is, replace it with this new one? I know (or at least
> think
> I know) how to do this "replacement", but it's the comparison I'm failing
> to
> grasp.
> Thanks for any answers :)
> --
> João Rodrigues

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message