lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?
Date Tue, 01 Feb 2005 17:42:16 GMT
Chuck Williams wrote:
> Doug Cutting wrote:
>   > What did you think of my DensityPhraseQuery proposal?
> It is a step in the direction of what I have in mind, but I'd like to go
> further.  How about a query class with these properties:
>   1.  Inputs are:
>       a.  F = list of fields
>       b.  B = list of field boosts (1:1 correspondence with F)
>       c.  T = list of terms or phrases, each either optional or required
>       d.  P = proximity-sloping window
>   2.  Generate matches that contain every required T in some F, and if
> no required T's then at least one optional T if some F.
>   3.  Score matches based on these considerations:
>       a.  Normal TermQuery and PhraseQuery scores for individual matches
> in individual fields.
>       b.  Boost scores for proximity of TermQuery and PhraseQuery
> matches in individual fields, based on some function of P (term
> proximity).
>       c.  Boost scores based on number of optional T's matched in at
> least one F (term diversity).

That's a lot of functionality bundled into a single Query class!  I'd 
rather make it possible to assemble this from reusable parts.  And it 
almost can be already.  Then we can offer such a thing pre-packaged.

So let me take it point-by-point:

1a-c is the new MultiFieldQueryParser implementation.
1d is Similarity.sloppyFreq()
2 is BooleanQuery (except the weird optional stuff)
3a is TermQuery and PhraseQuery
3b is DensityPhraseQuery (to be implemented)
3c is Similarity.coord()

So I think this can be implemented using the expansion I proposed 
yesterday for MultiFieldQueryParser, plus something like my 
DensityPhraseQuery and perhaps a few Similarity tweaks.

>   > If field boosting needs to then trump idf, we should be able to deal
>   > with that when we subsequently tune field boosting, no?  We can,
> e.g.,
>   > square the field boosts if we need.
> Perhaps, but that seems to me to be a hack on top of a hack.  Current
> literature seems to consistently not square idf -- I found one reference
> that specifically says even Salton removed the squaring after he first
> proposed it a long time ago.  The simpler solution is just to remove the
> squaring.

I wasn't arguing that we shouldn't alter the idf definition.  Precisely 
the opposite in fact.  If squaring idf is bad, then that should show up 
in single-field search and we can adjust it in that context.  You had 
claimed that good idf formulation is confounded with multi-field search. 
  I do not believe that and that's what I was speaking to.  The Salton 
work you cite is all single-field stuff.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message