lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <markrmil...@gmail.com>
Subject Re: How to query for similar documents before indexing
Date Mon, 10 May 2010 23:58:27 GMT
There is no official support for dedupe at search time. You can take a 
look at the field collapse patch in JIRA though - we where thinking 
ahead when we added the ability to tag dupes during indexing for field 
collapsing at search time - but the search side support is not there yet.

On 5/10/10 7:39 PM, Ken Krugler wrote:
> Hi all (especially Yonik),
>
> At the http://wiki.apache.org/solr/Deduplication page, it mentions
> "duplicate field collapsing" and later "Allow for both duplicate
> collapsing in search results..."
>
> But I don't see any mention of how deduplication happens during search
> time. Normally this requires that the field be stored (not just
> indexed), and for efficiency it might need to be in a FieldCache. I'm
> wondering about both status of this support, and thoughts on potential
> impact to index/memory size.
>
> Thanks,
>
> -- Ken
>
>
> On May 10, 2010, at 3:07pm, Markus Jelsma wrote:
>
>> Hi Matthieu,
>>
>> On the top of the wiki page you can see it's in 1.4 already. As far as
>> i know the API doesn't return information on found duplicates in its
>> response header, the wiki isn't clear on that subject. I, at least,
>> never saw any other response than an error or the usual status code
>> and QTime.
>>
>> Perhaps it would be a nice feature. On the other hand, you can also
>> have a manual process that finds duplicates based on that signature
>> and gather that information yourself as long as such a feature isn't
>> there.
>>
>> Cheers,
>>
>> -----Original message-----
>> From: Matthieu Labour <matthieu_labour@yahoo.com>
>> Sent: Mon 10-05-2010 23:30
>> To: solr-user@lucene.apache.org;
>> Subject: RE: How to query for similar documents before indexing
>>
>> Markus
>> Thank you for your response
>> That would be great if the index has the option to prevent duplicate
>> from entering the index. But is it going to be a silent action ? Or
>> will the add method return that it failed indexing because it detected
>> a duplicate ?
>> Is it commited to the 1.4 already ?
>> Cheers
>> matt
>>
>>
>> --- On Mon, 5/10/10, Markus Jelsma <markus.jelsma@buyways.nl> wrote:
>>
>> From: Markus Jelsma <markus.jelsma@buyways.nl>
>> Subject: RE: How to query for similar documents before indexing
>> To: solr-user@lucene.apache.org
>> Date: Monday, May 10, 2010, 4:11 PM
>>
>> Hi,
>>
>> Deduplication [1] is what you're looking for.It can utilize different
>> analyzers that will add a one or more signatures or hashes to your
>> document depending on exact or partial matches for configurable
>> fields. Based on that, it should be able to prevent new documents from
>> entering the index.
>>
>> The first part works very well but i have some issues with removing
>> those documents on which i also need to check with the community
>> tomorrow back at work ;-)
>>
>>
>> [1]: http://wiki.apache.org/solr/Deduplication
>>
>> Cheers,
>>
>>
>>
>> -----Original message-----
>> From: Matthieu Labour <matthieu_labour@yahoo.com>
>> Sent: Mon 10-05-2010 22:41
>> To: solr-user@lucene.apache.org;
>> Subject: How to query for similar documents before indexing
>>
>> Hi
>>
>> I want to implement the following logic:
>>
>> Before I index a new document into the index, I want to check if there
>> are already documents in the index with similar content to the content
>> of the document about to be inserted. If the request returns 1 or more
>> documents, then I don't want to insert the document.
>>
>> What is the best way to achieve the above functionality ?
>>
>> I read about Fuzzy searches in logic. But can I really build a request
>> such as
>> mydoc.title:wordexample~ AND mydoc.content:( all the content words)~0.9 ?
>>
>> Thank you for your help
>>
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c w e b m i n i n g
>
>
>
>


-- 
- Mark

http://www.lucidimagination.com

Mime
View raw message