lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@buyways.nl>
Subject RE: How to query for similar documents before indexing
Date Mon, 10 May 2010 22:07:14 GMT
Hi Matthieu,

 

 

On the top of the wiki page you can see it's in 1.4 already. As far as i know the API doesn't
return information on found duplicates in its response header, the wiki isn't clear on that
subject. I, at least, never saw any other response than an error or the usual status code
and QTime.

 

Perhaps it would be a nice feature. On the other hand, you can also have a manual process
that finds duplicates based on that signature and gather that information yourself as long
as such a feature isn't there.

 

 

Cheers,


 
-----Original message-----
From: Matthieu Labour <matthieu_labour@yahoo.com>
Sent: Mon 10-05-2010 23:30
To: solr-user@lucene.apache.org; 
Subject: RE: How to query for similar documents before indexing

Markus
Thank you for your response
That would be great if the index has the option to prevent duplicate from entering the index.
But is it going to be a silent action ? Or will the add method return that it failed indexing
because it detected a duplicate ?
Is it commited to the 1.4 already ?
Cheers
matt


--- On Mon, 5/10/10, Markus Jelsma <markus.jelsma@buyways.nl> wrote:

From: Markus Jelsma <markus.jelsma@buyways.nl>
Subject: RE: How to query for similar documents before indexing
To: solr-user@lucene.apache.org
Date: Monday, May 10, 2010, 4:11 PM

Hi,

 

 

Deduplication [1] is what you're looking for.It can utilize different analyzers that will
add a one or more signatures or hashes to your document depending on exact or partial matches
for configurable fields. Based on that, it should be able to prevent new documents from entering
the index. 

 

The first part works very well but i have some issues with removing those documents on which
i also need to check with the community tomorrow back at work ;-)

 

 

[1]: http://wiki.apache.org/solr/Deduplication


 

Cheers,


 
-----Original message-----
From: Matthieu Labour <matthieu_labour@yahoo.com>
Sent: Mon 10-05-2010 22:41
To: solr-user@lucene.apache.org; 
Subject: How to query for similar documents before indexing

Hi

I want to implement the following logic:

Before I index a new document into the index, I want to check if there are already documents
in the index with similar content to the content of the document about to be inserted. If
the request returns 1 or more documents, then I don't want to insert the document.

What is the best way to achieve the above functionality ?

I read about Fuzzy searches in logic. But can I really build a request such as 
mydoc.title:wordexample~ AND mydoc.content:( all the content words)~0.9 ?

Thank you for your help




     
 



      
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message