lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <>
Subject Re: dealing with duplicates
Date Sat, 01 Aug 2009 06:32:41 GMT

Maybe we can take a step back first.  Would it be better if your index was cleaner and didn't
have flagged duplicates in the first place?  If so, have you tried using

Sematext is hiring --
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR

----- Original Message ----
> From: Joe Calderon <>
> To:
> Sent: Friday, July 31, 2009 5:06:48 PM
> Subject: dealing with duplicates
> hello all, i have a collection of a few million documents; i have many
> duplicates in this collection. they have been clustered with a simple
> algorithm, i have a field called 'duplicate' which is 0 or 1 and a
> fields called 'description, tags, meta', documents are clustered on
> different criteria and the text i search against could be very
> different among members of a cluster.
> im currently using a dismax handler to search across the text fields
> with different boosts, and a filter query to restrict to masters
> (duplicate: 0)
> my question is then, how do i best query for documents which are
> masters OR match text but are not included in the matched set of
> masters?
> does this make sense?

View raw message