lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: How to query for similar documents before indexing
Date Mon, 10 May 2010 23:39:34 GMT
Hi all (especially Yonik),

At the http://wiki.apache.org/solr/Deduplication page, it mentions  
"duplicate field collapsing" and later "Allow for both duplicate  
collapsing in search results..."

But I don't see any mention of how deduplication happens during search  
time. Normally this requires that the field be stored (not just  
indexed), and for efficiency it might need to be in a FieldCache. I'm  
wondering about both status of this support, and thoughts on potential  
impact to index/memory size.

Thanks,

-- Ken


On May 10, 2010, at 3:07pm, Markus Jelsma wrote:

> Hi Matthieu,
>
> On the top of the wiki page you can see it's in 1.4 already. As far  
> as i know the API doesn't return information on found duplicates in  
> its response header, the wiki isn't clear on that subject. I, at  
> least, never saw any other response than an error or the usual  
> status code and QTime.
>
> Perhaps it would be a nice feature. On the other hand, you can also  
> have a manual process that finds duplicates based on that signature  
> and gather that information yourself as long as such a feature isn't  
> there.
>
> Cheers,
>
> -----Original message-----
> From: Matthieu Labour <matthieu_labour@yahoo.com>
> Sent: Mon 10-05-2010 23:30
> To: solr-user@lucene.apache.org;
> Subject: RE: How to query for similar documents before indexing
>
> Markus
> Thank you for your response
> That would be great if the index has the option to prevent duplicate  
> from entering the index. But is it going to be a silent action ? Or  
> will the add method return that it failed indexing because it  
> detected a duplicate ?
> Is it commited to the 1.4 already ?
> Cheers
> matt
>
>
> --- On Mon, 5/10/10, Markus Jelsma <markus.jelsma@buyways.nl> wrote:
>
> From: Markus Jelsma <markus.jelsma@buyways.nl>
> Subject: RE: How to query for similar documents before indexing
> To: solr-user@lucene.apache.org
> Date: Monday, May 10, 2010, 4:11 PM
>
> Hi,
>
> Deduplication [1] is what you're looking for.It can utilize  
> different analyzers that will add a one or more signatures or hashes  
> to your document depending on exact or partial matches for  
> configurable fields. Based on that, it should be able to prevent new  
> documents from entering the index.
>
> The first part works very well but i have some issues with removing  
> those documents on which i also need to check with the community  
> tomorrow back at work ;-)
>
>
> [1]: http://wiki.apache.org/solr/Deduplication
>
> Cheers,
>
>
>
> -----Original message-----
> From: Matthieu Labour <matthieu_labour@yahoo.com>
> Sent: Mon 10-05-2010 22:41
> To: solr-user@lucene.apache.org;
> Subject: How to query for similar documents before indexing
>
> Hi
>
> I want to implement the following logic:
>
> Before I index a new document into the index, I want to check if  
> there are already documents in the index with similar content to the  
> content of the document about to be inserted. If the request returns  
> 1 or more documents, then I don't want to insert the document.
>
> What is the best way to achieve the above functionality ?
>
> I read about Fuzzy searches in logic. But can I really build a  
> request such as
> mydoc.title:wordexample~ AND mydoc.content:( all the content  
> words)~0.9 ?
>
> Thank you for your help
>

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Mime
View raw message