lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kelly Taylor <>
Subject Re: Encountering a roadblock with my Solr schema design...use dedupe?
Date Thu, 14 Jan 2010 01:35:47 GMT


Would you suggest using dedup for my use case; and if so, do you know of a
working example I can reference?

I don't have an issue using the patched version of Solr, but I'd much rather
use the GA version.


hossman wrote:
> : Dedupe is completely the wrong word. Deduping is something else
> : entirely - it is about trying not to index the same document twice.
> Dedup can also certainly be used with field collapsing -- that was one of 
> the initial use cases identified for the SignatureUpdateProcessorFactory 
> ... you can compute an 'expensive' signature when adding a document, index 
> it, and then FieldCollapse on that signature field.
> This gives you "query time deduplication" based on a value computed when 
> indexing (the canonical example is multiple urls refrenceing the "same" 
> content but with slightly differnet boilerplate markup.  You can use a 
> Signature class that recognizes the boilerplate and computes an identical 
> signature value for each URL whose content is "the same" but still index 
> all of the URLs and their content as distinct documents ... so use cases 
> where people only "distinct" URLs work using field collapse but by default 
> all matching documents can still be returned and searches on text in the 
> boilerplate markup also still work.
> -Hoss

View this message in context:
Sent from the Solr - User mailing list archive at

View raw message