lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Question about http://wiki.apache.org/solr/Deduplication
Date Sat, 02 Apr 2011 23:05:05 GMT

: Is it possible in solr to have multivalued "id"? Or I need to make my
: own "mv_ID" for this? Any ideas how to achieve this efficiently?

This isn't something the SignatureUpdateProcessor is going to be able to 
hel pyou with -- it does the deduplication be changing hte low level 
"update" (implemented as a delete then add) so that the key used to delete 
the older documents is based on the signature field instead of the id 
field.

in order to do what you are describing, you would need to query the index 
for matching signatures, then add the resulting ids to your document 
before doing that "update"

You could posibly do this in a custom UpdateProcessor, but you'd have to 
do something tricky to ensure you didn't overlook docs that had been addd 
but not yet committed when checking for dups.

I don't have a good suggestion for how to do this internally in Slr -- it 
seems like the type of bulk processing logic that would be better suited 
for an external process before you ever start indexing (much like link 
analysis for back refrences)

-Hoss

Mime
View raw message