lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: avoid overwrite in DataImportHandler
Date Thu, 08 Dec 2011 13:35:13 GMT
This is all controlled by Solr via the <uniqueKey> field in your schema. Just
remove that entry.

But then it's all up to you to handle the fact that there will be multiple
documents with the same ID all returned as a result of querying. And
it won't matter what program adds data, *nothing* will be overwritten,
DIH has no part in that decision.

Deduplication is about defining some fields in your record and avoiding
adding another document if the contents are "close", where close is a
slippery concept. I don't think it's related to your problem at all.

Best
Erick

On Wed, Dec 7, 2011 at 3:27 PM, P Williams
<williams.tricia.list@gmail.com> wrote:
> Hi,
>
> I've wondered the same thing myself.  I feel like the "clean" parameter has
> something to do with it but it doesn't work as I'd expect either.  Thanks
> in advance to anyone who can answer this question.
>
> *clean* : (default 'true'). Tells whether to clean up the index before the
> indexing is started.
>
> Tricia
>
> On Wed, Dec 7, 2011 at 12:49 PM, sabman <saby83@gmail.com> wrote:
>
>> I have a unique ID defined for the documents I am indexing. I want to avoid
>> overwriting the documents that have already been indexed. I am using
>> XPathEntityProcessor and TikaEntityProcessor to process the documents.
>>
>> The DataImportHandler does not seem to have the option to set
>> overwrite=false. I have read some other forums to use deduplication instead
>> but I don't see how it is related to my problem.
>>
>> Any help on this (or explanation on how deduplication would apply to my
>> probelm ) would be great. Thanks!
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/avoid-overwrite-in-DataImportHandler-tp3568435p3568435.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>

Mime
View raw message