lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Upayavira ...@odoko.co.uk>
Subject Re: Find documents that are composed of % words
Date Thu, 10 Oct 2013 13:19:18 GMT
Right - aside from the interesting intellectual exercise, the correct
question to ask is, "why?"

Why would you want to do this? What's the benefit, and is there a way of
doing it that is more in keeping with how Solr has been designed?

Upayavira

On Thu, Oct 10, 2013, at 01:17 PM, Erick Erickson wrote:
> Just to add my $0.02. Often this kind of thing is
> a mistaken assumption on the part of the client
> that they know how to score documents better
> than the really bright people who put a lot of time
> and energy into scoring (note, I'm _certainly_
> not one of those people!). I'll  often, instead of
> making something like this work, see if I can
> tweak the scoring for a "good enough" solution.
> This can be a time-sink of the first magnitude for
> very little actual benefit.
> 
> Very often, if you get "good enough" results and
> put this kind of refinement on the back burner when
> "more important" features are done it never seems
> to percolate up to the point of needing work. And it's
> a disservice to clients to agree to implementing
> something like this without at least discussing
> what you _won't_ be able to do if you do this.
> 
> Best,
> Erick
> 
> 
> 
> On Thu, Oct 10, 2013 at 7:51 AM, Upayavira <uv@odoko.co.uk> wrote:
> >
> >
> > On Wed, Oct 9, 2013, at 02:45 PM, shahzad73 wrote:
> >> my client has a strange requirement,   he will give a list of 500 words
> >> and
> >> then set a percentage like 80%   now he want to find those pages or
> >> documents which consist of the only those 80% of 500   and only 20%
> >> unknown.
> >> like   we have this document
> >>
> >>              word1 word2 word3 word4
> >>
> >> and he give the list  word1 word2 word3     and set the accuracy to 75%
> >> the above doc will meet the criteria because no 1 it matches all words
> >> and
> >> only 25% words are unknow from the list of searches.
> >>
> >> here is another way to say this  " if 500 words are provided in search
> >> then
> >> All 500 words words must exist in the document  and unknow words should
> >> be
> >> only 20%  if accracy is 80%"
> >
> > As best as I can see, Solr can't quite do this, at least without
> > enhancement.
> >
> > There's two parts to how Solr works - boolean querying, in which a
> > document either matches, or doesn't. The first part is to work out how
> > to select the documents you are interested in.
> >
> > The second part is scoring, which involves calculating a score for all
> > of the documents that have got through the previous round.
> >
> > It seems the boolean portion could be achieved using
> > minimum-should-match=100%. That is, all terms must be there.
> >
> > You can almost do the scoring portion by sorting on function queries, by
> > sorting on sum(termfreq(text, 'word1'), termfreq(text, 'word2')) etc -
> > that'd give you the number of times your query terms appear in the
> > field, but the issue is there's no way to record the number of terms in
> > a particular field.
> >
> > Perhaps you could pre-tokenise the field before indexing it, and store
> > the number of terms in your index. Then, your score would be the sum of
> > the termfreq(text, '<yourterms>') values, divided by the total number of
> > terms in the document.
> >
> > Almost there, but the last leg is not quite.
> >
> > I don't know whether it is possible to write a fieldlength(text)
> > function that returns the number of terms in the field.
> >
> > Upayavira

Mime
View raw message