lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Encountering a roadblock with my Solr schema design...use dedupe?
Date Thu, 14 Jan 2010 00:45:27 GMT

: Dedupe is completely the wrong word. Deduping is something else
: entirely - it is about trying not to index the same document twice.

Dedup can also certainly be used with field collapsing -- that was one of 
the initial use cases identified for the SignatureUpdateProcessorFactory 
... you can compute an 'expensive' signature when adding a document, index 
it, and then FieldCollapse on that signature field.

This gives you "query time deduplication" based on a value computed when 
indexing (the canonical example is multiple urls refrenceing the "same" 
content but with slightly differnet boilerplate markup.  You can use a 
Signature class that recognizes the boilerplate and computes an identical 
signature value for each URL whose content is "the same" but still index 
all of the URLs and their content as distinct documents ... so use cases 
where people only "distinct" URLs work using field collapse but by default 
all matching documents can still be returned and searches on text in the 
boilerplate markup also still work.


-Hoss


Mime
View raw message