lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <j...@basetechnology.com>
Subject Re: Document Similarity Algorithm at Solr/Lucene
Date Tue, 23 Jul 2013 13:52:03 GMT
One classic approach is to simply use the full text of the suspect text as 
well as bigrams and trigrams (phrases) from that text with "OR" operators. 
The top results will be the documents that most closely "match" the subject 
text. That provides a visual set similar results. You will then have to 
apply some heuristic of your own as far as how many top results to look at 
or what score to cut off at. The use of "OR" operators assures that similar 
documents will be found even if not 100% of the words are used. Yes, "OR" 
guarantees that your total result count will be high, but scoring assures 
that the top results will be more relevant.

-- Jack Krupansky

-----Original Message----- 
From: Furkan KAMACI
Sent: Tuesday, July 23, 2013 6:16 AM
To: solr-user@lucene.apache.org
Subject: Re: Document Similarity Algorithm at Solr/Lucene

Actually I need a specialized algorithm. I want to use that algorithm to
detect duplicate blog posts.

2013/7/23 Tommaso Teofili <tommaso.teofili@gmail.com>

> Hi,
>
> I you may leverage and / or improve MLT component [1].
>
> HTH,
> Tommaso
>
> [1] : http://wiki.apache.org/solr/MoreLikeThis
>
>
> 2013/7/23 Furkan KAMACI <furkankamaci@gmail.com>
>
> > Hi;
> >
> > Sometimes a huge part of a document may exist in another document. As
> like
> > in student plagiarism or quotation of a blog post at another blog post.
> > Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class to
> > detect it?
> >
> 


Mime
View raw message