lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?
Date Tue, 01 Feb 2005 19:05:09 GMT
Chuck Williams wrote:
>   > So I think this can be implemented using the expansion I proposed
>   > yesterday for MultiFieldQueryParser, plus something like my
>   > DensityPhraseQuery and perhaps a few Similarity tweaks.
> I don't think that works unless the mechanism is limited to default-AND
> (i.e., all clauses required).

Right.  I have repeatedly argued for default-AND.

> However, I don't see a way to integrate term proximity into that
> expansion.  Specifically, I don't see a way to handle proximity and
> coverage simultaneously without managing the multiple fields, field
> boosts and proximity considerations in a single query class.  Whence,
> the proposal for such a class.

To repeat my three-term, two-field example:

+(f1:t1^b1 f2:t1^b2)
+(f1:t2^b1 f2:t2^b2)
+(f1:t3^b1 f2:t3^b2)
f1:"t1 t2 t3"~s1^b3
f2:"t1 t2 t3"~s2^b4

Coverage is handled by the first three clauses.  Each term must match in 
at least one field.  Proximity is boosted by the last two clauses: when 
terms occur close together, the score is increased.  The implementation 
of the ~ operator could be improved, as I proposed.

> Do you see a way to do that?  I.e., do you see a scalable expansion that
> addresses all the issues for both default-or and default-and?

I am not really very interested in default-OR.  I think there are good 
reasons that folks have gravitated towards default-AND.  I would prefer 
we focus on a good default-AND solution for now.

If one wishes to rank things by coordination first, and then by score, 
as an improved default-OR, then one needs more than just score-based 
ranking.  Trying to concoct scores that alone guarantee such a ranking 
is very fragile.  In general, one would need a HitCollector API that 
takes both the coord and the score.  This is possible, but I'm not in a 
hurry to implement it.

Lucene's development is constrained.  We want to improve  Lucene, to 
make search results better, to make it faster, and add needed features, 
but we must at the same time keep it back-compatible, maintainable and 
easy-to-use.  The smaller the code, the easier it is to maintain and 
understand, so, e.g., a change that adds a lot of new code is harder to 
accept than one that just tweaks existing code a bit.  We are changing 
many APIs for Lucene 2.0, but we're also providing a clear migration 
path for Lucene 1.X users.  When we add a new, improved API we must 
deprecate the API it replaces and make sure that the new API supports 
all the features of the old API.  We cannot afford to maintain multiple 
implementations of similar functionality.  So, for these reasons, I am 
not comfortable simply comitting your DistributingMultiFieldQueryParser 
and MaxDisjunctionQuery.  We need to fit these into Lucene, figure out 
what they replace, etc.  Otherwise Lucene could just become a 
hodge-podge of poorly maintained classes.  If we think these or 
something like them do a better job, then we'd like it to be natural for 
folks upgrading to start using them in favor of old methods, so that, 
long term, we don't have to maintain both.  So the problem is not simply 
figuring out what a better default ranking algorithm is, it is also 
figuring out how to sucessfully integrate such an algorithm into Lucene.

> I think
> the query class I've proposed does that, and should be no more complex
> than the current SpanQuery mechanism, for example.

The SpanQuery mechanism is quite complex and permits matching of a 
completely different sort: fragments rather than whole documents.  What 
you're proposing does not seem so radically different that it cannot be 
part of the normal document-matching mechansim.

> Also, it should be
> more efficient than a nested construction of more primitive components
> since it can be directly optimized.

It might use a bit less CPU, but would not reduce i/o.  My proposal 
processes TermDocs twice, but since Lucene processes query terms in 
parallel, and with filesystem caching, no extra i/o will be performed.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message