lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <>
Subject Re: Document Similarity Algorithm at Solr/Lucene
Date Wed, 24 Jul 2013 17:41:19 GMT
Note that this can lead to performance issues.  Queries with lots of
hits require lots of scoring and this will make queries slower.
We had this case with a client about 2 weeks ago.  We were able to
spot this change in the change in the average number of hits
before/after query changes (meant to help with relevance) using
Sematext Search Analytics which clearly shows a spike in the number of
queries with lots of hits (very "loose" queries) and we were able to
correlate that with performance issues.

So careful with expanding your queries too much without watching
before/after performance.

Solr & ElasticSearch Support --
Performance Monitoring --

On Tue, Jul 23, 2013 at 3:52 PM, Jack Krupansky <> wrote:
> One classic approach is to simply use the full text of the suspect text as
> well as bigrams and trigrams (phrases) from that text with "OR" operators.
> The top results will be the documents that most closely "match" the subject
> text. That provides a visual set similar results. You will then have to
> apply some heuristic of your own as far as how many top results to look at
> or what score to cut off at. The use of "OR" operators assures that similar
> documents will be found even if not 100% of the words are used. Yes, "OR"
> guarantees that your total result count will be high, but scoring assures
> that the top results will be more relevant.
> -- Jack Krupansky
> -----Original Message----- From: Furkan KAMACI
> Sent: Tuesday, July 23, 2013 6:16 AM
> To:
> Subject: Re: Document Similarity Algorithm at Solr/Lucene
> Actually I need a specialized algorithm. I want to use that algorithm to
> detect duplicate blog posts.
> 2013/7/23 Tommaso Teofili <>
>> Hi,
>> I you may leverage and / or improve MLT component [1].
>> HTH,
>> Tommaso
>> [1] :
>> 2013/7/23 Furkan KAMACI <>
>> > Hi;
>> >
>> > Sometimes a huge part of a document may exist in another document. As
>> like
>> > in student plagiarism or quotation of a blog post at another blog post.
>> > Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class to
>> > detect it?
>> >

View raw message