lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kelly Taylor <wired...@hotmail.com>
Subject Re: Encountering a roadblock with my Solr schema design...use dedupe?
Date Thu, 14 Jan 2010 01:35:47 GMT

Hoss,

Would you suggest using dedup for my use case; and if so, do you know of a
working example I can reference?

I don't have an issue using the patched version of Solr, but I'd much rather
use the GA version.

-Kelly



hossman wrote:
> 
> 
> : Dedupe is completely the wrong word. Deduping is something else
> : entirely - it is about trying not to index the same document twice.
> 
> Dedup can also certainly be used with field collapsing -- that was one of 
> the initial use cases identified for the SignatureUpdateProcessorFactory 
> ... you can compute an 'expensive' signature when adding a document, index 
> it, and then FieldCollapse on that signature field.
> 
> This gives you "query time deduplication" based on a value computed when 
> indexing (the canonical example is multiple urls refrenceing the "same" 
> content but with slightly differnet boilerplate markup.  You can use a 
> Signature class that recognizes the boilerplate and computes an identical 
> signature value for each URL whose content is "the same" but still index 
> all of the URLs and their content as distinct documents ... so use cases 
> where people only "distinct" URLs work using field collapse but by default 
> all matching documents can still be returned and searches on text in the 
> boilerplate markup also still work.
> 
> 
> -Hoss
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Encountering-a-roadblock-with-my-Solr-schema-design...use-dedupe--tp27118977p27155115.html
Sent from the Solr - User mailing list archive at Nabble.com.


Mime
View raw message