lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Calderon <calderon....@gmail.com>
Subject Re: dealing with duplicates
Date Sat, 01 Aug 2009 16:15:32 GMT
hello, thanks for the response, i did take a look at that document but
in my application i actually want the duplicates, as i mentioned, the
matching text could be very different among cluster members, what
joins them together is a similar set of numeric features.

currently i do a query with fq=duplicate:0 and show a link to
optionally show the "dupes" via by querying for all dupes of the
master id, however im currently missing any documents that matched the
query but are duplicates of other masters not included in that result
set.

in a relational database (fulltext indexing aside) i would use a
subquery, i imagine a similar approach could be used with lucene, i
just dont know the syntax

best,

--joe

On Fri, Jul 31, 2009 at 11:32 PM, Otis
Gospodnetic<otis_gospodnetic@yahoo.com> wrote:
> Joe,
>
> Maybe we can take a step back first.  Would it be better if your index was cleaner and
didn't have flagged duplicates in the first place?  If so, have you tried using http://wiki.apache.org/solr/Deduplication
?
>
>  Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>
>
>
> ----- Original Message ----
>> From: Joe Calderon <calderon.joe@gmail.com>
>> To: solr-user@lucene.apache.org
>> Sent: Friday, July 31, 2009 5:06:48 PM
>> Subject: dealing with duplicates
>>
>> hello all, i have a collection of a few million documents; i have many
>> duplicates in this collection. they have been clustered with a simple
>> algorithm, i have a field called 'duplicate' which is 0 or 1 and a
>> fields called 'description, tags, meta', documents are clustered on
>> different criteria and the text i search against could be very
>> different among members of a cluster.
>>
>> im currently using a dismax handler to search across the text fields
>> with different boosts, and a filter query to restrict to masters
>> (duplicate: 0)
>>
>> my question is then, how do i best query for documents which are
>> masters OR match text but are not included in the matched set of
>> masters?
>>
>> does this make sense?
>
>

Mime
View raw message