lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <>
Subject Re: Removing duplicate entries
Date Wed, 30 Apr 2008 15:10:57 GMT
See below:

On Tue, Apr 29, 2008 at 9:51 PM, João Rodrigues <> wrote:

> First of all, let me apologize for the double post but I got some strange
> error message =\
> >The first question is what do you mean the document
> >is already in the index? Lucene doc IDs are useless
> >here since the ones in your FSDir and the ones in your
> >RAMdir are unrelated. In fact, I suspect that the
> >lucene docIDs will start at the same number in both.
> The documents have 2 fields, one of them being their identifier, which is
> Indexed and Not Tokenized (and stored).

Cool, this will work just fine for you. You can forget everything
I said about Lucene doc IDs.

> >Lucene doc IDs are just monotonically incremented integers.
> >So, how do you identify identical documents? Is there some
> >field in your document that's guaranteed to be unique to each
> >document? If so, *that's* the field you can use for termEnu
> >to get the Lucene docid to remove, assuming you've indexed
> >it UN_TOKENIZED or you are very, very, very confident that
> >your tokenizers won't break it up.
> >But you can make this easier by using IndexReader.deleteDocument(term)
> >where the term is your unique field.
> Since I have an unique field for each document, I'll use that then.
> However,
> what's the thinking behind the enum/delete? The termEnum lists me the
> terms
> which have a particular field value and gives me their "lucene ID" so that
> the Reader can then delete it?

Probably something very like that, although you see none of that. Just
doing a deleteDocument(term) does it all for you. And I learned long ago
that the folks who write this kind of stuff can probably do it more
than I can <G>.

> >Additionally, I question why you bother with a RAMdir for your changes.
> >An index reader essentially takes a snapshot of your index, and
> >subsequent changes are not seen by your searchers until you
> >close and reopen the underlying reader. What advantage do you
> >see in using a RAMdir?
> I use a RAMdir because it is somehow faster. I'm downloading a few
> thousand
> documents at a time from the internet and then indexing them in a RAM
> index
> and then merging them to the FS dir. Also, in case anything fails, my FS
> directory, and thus my "stable" index, is safe from harm :) But why did
> you
> ask? Advice is welcome :)

I ask because a lot of people are mistakenly suppose that RAMdirs are faster
because they have RAM in front <G>. The indexing process uses RAM implicitly
and periodically flushes to FS. You can control this by various parameters
IndexWriter like setMaxMergeDocs, setMergeFactor etc. and I personally
prefer letting that do the work and avoiding having to code merging the two.

But your point about failure is valid, although you might want to search the
mail archive for discussions on this as there's been work done to make this
less of a concern....


> Thanks,
> João Rodrigues

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message