lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "João Rodrigues" <>
Subject Re: Removing duplicate entries
Date Wed, 30 Apr 2008 01:51:19 GMT
First of all, let me apologize for the double post but I got some strange
error message =\

>The first question is what do you mean the document
>is already in the index? Lucene doc IDs are useless
>here since the ones in your FSDir and the ones in your
>RAMdir are unrelated. In fact, I suspect that the
>lucene docIDs will start at the same number in both.

The documents have 2 fields, one of them being their identifier, which is
Indexed and Not Tokenized (and stored).

>Lucene doc IDs are just monotonically incremented integers.
>So, how do you identify identical documents? Is there some
>field in your document that's guaranteed to be unique to each
>document? If so, *that's* the field you can use for termEnu
>to get the Lucene docid to remove, assuming you've indexed
>it UN_TOKENIZED or you are very, very, very confident that
>your tokenizers won't break it up.
>But you can make this easier by using IndexReader.deleteDocument(term)
>where the term is your unique field.

Since I have an unique field for each document, I'll use that then. However,
what's the thinking behind the enum/delete? The termEnum lists me the terms
which have a particular field value and gives me their "lucene ID" so that
the Reader can then delete it?

>Additionally, I question why you bother with a RAMdir for your changes.
>An index reader essentially takes a snapshot of your index, and
>subsequent changes are not seen by your searchers until you
>close and reopen the underlying reader. What advantage do you
>see in using a RAMdir?

I use a RAMdir because it is somehow faster. I'm downloading a few thousand
documents at a time from the internet and then indexing them in a RAM index
and then merging them to the FS dir. Also, in case anything fails, my FS
directory, and thus my "stable" index, is safe from harm :) But why did you
ask? Advice is welcome :)


João Rodrigues
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message