lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: dealing with duplicates
Date Sat, 01 Aug 2009 06:32:41 GMT
Joe,

Maybe we can take a step back first.  Would it be better if your index was cleaner and didn't
have flagged duplicates in the first place?  If so, have you tried using http://wiki.apache.org/solr/Deduplication
?

 Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----
> From: Joe Calderon <calderon.joe@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Friday, July 31, 2009 5:06:48 PM
> Subject: dealing with duplicates
> 
> hello all, i have a collection of a few million documents; i have many
> duplicates in this collection. they have been clustered with a simple
> algorithm, i have a field called 'duplicate' which is 0 or 1 and a
> fields called 'description, tags, meta', documents are clustered on
> different criteria and the text i search against could be very
> different among members of a cluster.
> 
> im currently using a dismax handler to search across the text fields
> with different boosts, and a filter query to restrict to masters
> (duplicate: 0)
> 
> my question is then, how do i best query for documents which are
> masters OR match text but are not included in the matched set of
> masters?
> 
> does this make sense?


Mime
View raw message