lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Upayavira ...@odoko.co.uk>
Subject Re: Find documents that are composed of % words
Date Thu, 10 Oct 2013 11:49:34 GMT


On Wed, Oct 9, 2013, at 02:45 PM, shahzad73 wrote:
> my client has a strange requirement,   he will give a list of 500 words
> and
> then set a percentage like 80%   now he want to find those pages or
> documents which consist of the only those 80% of 500   and only 20%
> unknown.    
> like   we have this document       
>             
>              word1 word2 word3 word4           
> 
> and he give the list  word1 word2 word3     and set the accuracy to 75%   
> the above doc will meet the criteria because no 1 it matches all words  
> and
> only 25% words are unknow from the list of searches. 
> 
> here is another way to say this  " if 500 words are provided in search
> then
> All 500 words words must exist in the document  and unknow words should
> be
> only 20%  if accracy is 80%"

As best as I can see, Solr can't quite do this, at least without
enhancement.

There's two parts to how Solr works - boolean querying, in which a
document either matches, or doesn't. The first part is to work out how
to select the documents you are interested in.

The second part is scoring, which involves calculating a score for all
of the documents that have got through the previous round.

It seems the boolean portion could be achieved using
minimum-should-match=100%. That is, all terms must be there. 

You can almost do the scoring portion by sorting on function queries, by
sorting on sum(termfreq(text, 'word1'), termfreq(text, 'word2')) etc -
that'd give you the number of times your query terms appear in the
field, but the issue is there's no way to record the number of terms in
a particular field.

Perhaps you could pre-tokenise the field before indexing it, and store
the number of terms in your index. Then, your score would be the sum of
the termfreq(text, '<yourterms>') values, divided by the total number of
terms in the document.

Almost there, but the last leg is not quite.

I don't know whether it is possible to write a fieldlength(text)
function that returns the number of terms in the field.

Upayavira

Mime
View raw message