lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: How to not overwrite a Document if it 'already exists'?
Date Wed, 06 May 2009 00:38:54 GMT
On Tue, May 5, 2009 at 7:24 PM, Antony Bowesman <> wrote:
> Michael McCandless wrote:
>> Lucene doesn't provide any way to do this, except opening a reader.
>> Opening a reader is not "that" expensive if you use it for this
>> purpose.  EG neither norms nor FieldCache will be loaded if you just
>> enumerate the term docs.
> Thanks for that info.  These indexes will be large, in the 10s of millions.
>  id field is unique and is 29 bytes.  I guess that's still a lot of data to
> trawl through to get to the term.

Have you tested how long it takes to look up docs from your id?

>> But, you can let Lucene do the same thing for you by just always using
>> updateDocument, which'll remove the old doc if it's present.
> That's precisely what I don't want to occur.  I have two forms of a
> Document, which represent mail items.  One 'full' version containing all
> index and stored data, which represents a searchable mail item and one
> 'base', which is simply a marker Document which represents a mail in a
> forwarded mail chain, with just a couple of stored fields containing the
> mail meta data.
> Under normal circumstances there are no problems as mails arrive in sequence
> and are never handled twice, but there is one case, during a reindex op,
> when the arrival of those mails can come out of sequence, i.e. a full mail
> is indexed first, but that mail is later processed as part of a forwarded
> mail chain of another mail.
> It is the second time that mail is handled as a base mail that I do not want
> it to overwrite the full version.
> Would it be technically difficult to support something like this in the
> IndexWriter API and if not, would it end up being more efficient that using
> a reader/terms to check this?

Couldn't you just give the base & full docs different ids?  Then you
can independently choose which one to update?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message