lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Calderon <>
Subject Re: dealing with duplicates
Date Mon, 10 Aug 2009 19:59:04 GMT
so in the case someone can help me with the query syntax, the
relational query i would use for this would be something like:

SELECT * FROM videos
title LIKE 'family guy'
AND desc LIKE 'stewie%'
  ( is_dup = 0 )
  ( is_dup = 1 AND id NOT IN
    SELECT id FROM videos
    title LIKE 'family guy'
    AND desc LIKE 'stewie%'
    AND is_dup = 0
ORDER BY views

can a similar query be written in lucene or do i need to structure my
index differently to be able to do such a query?

thx much


On Sat, Aug 1, 2009 at 9:15 AM, Joe Calderon<> wrote:
> hello, thanks for the response, i did take a look at that document but
> in my application i actually want the duplicates, as i mentioned, the
> matching text could be very different among cluster members, what
> joins them together is a similar set of numeric features.
> currently i do a query with fq=duplicate:0 and show a link to
> optionally show the "dupes" via by querying for all dupes of the
> master id, however im currently missing any documents that matched the
> query but are duplicates of other masters not included in that result
> set.
> in a relational database (fulltext indexing aside) i would use a
> subquery, i imagine a similar approach could be used with lucene, i
> just dont know the syntax
> best,
> --joe
> On Fri, Jul 31, 2009 at 11:32 PM, Otis
> Gospodnetic<> wrote:
>> Joe,
>> Maybe we can take a step back first.  Would it be better if your index was cleaner
and didn't have flagged duplicates in the first place?  If so, have you tried using
>>  Otis
>> --
>> Sematext is hiring --
>> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>> ----- Original Message ----
>>> From: Joe Calderon <>
>>> To:
>>> Sent: Friday, July 31, 2009 5:06:48 PM
>>> Subject: dealing with duplicates
>>> hello all, i have a collection of a few million documents; i have many
>>> duplicates in this collection. they have been clustered with a simple
>>> algorithm, i have a field called 'duplicate' which is 0 or 1 and a
>>> fields called 'description, tags, meta', documents are clustered on
>>> different criteria and the text i search against could be very
>>> different among members of a cluster.
>>> im currently using a dismax handler to search across the text fields
>>> with different boosts, and a filter query to restrict to masters
>>> (duplicate: 0)
>>> my question is then, how do i best query for documents which are
>>> masters OR match text but are not included in the matched set of
>>> masters?
>>> does this make sense?

View raw message