lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doron Cohen <>
Subject Re: Lucene scoring question (how to boost leading terms match)
Date Tue, 03 Oct 2006 18:40:55 GMT
If I understand the question, you do not want to boost in advance a certain
doc, but rather score higher those documents containing the search term
closer to the start of the document.

There is more to define here - for instance, if doc1 has 5 words but doc2
has 1,000,000 words, would you still prefer doc1? There is a field norm
factor in Lucene that assign higher scores to matches in shorter field -
would you like to override this as well?

To your question, I can think of these possibilities:

  (1) Write your own query, with a scorer that scores based on term
position (possibly with some relation to field length). This is not
straightforward, and I'm not sure this is the solution you were hoping for.

  (2) Use SpanFirstQuery - something like: new SpanFirstQuery(new
SpanTermQuery(new Term("fieldName","word")),8) - as Hoss suggested. But I
think that here again you would need to modify the scorer to score first
matches higher, because as far as I can see the SpanScorer in use there
does not pour affinity information into the score - i.e. both doc1 and doc2
in your example would get the same score, and the SpanFirstQurey would only
allow you to limit the set of returned documents - Hoss, do you agree with

  (3) When adding the documents to the index, add a special <doc-start>
token to each document - for instance by pre-padding this special token to
the text of the indexed document's field. Then and use a Lucene query that
scores higher terms that are not "too" far away, for instance using a
PhraseQuery with a slope factor greater than 0.
Lastly, modify the query
to a phrase query:
    "<doc-start> ABC"
with a slope factor that suitss your needs.
One problem I see with this is that all the documents in your index would
have this token.
Another problem is I don't think prefix queries (e.g. A*) are supported in
a phrase, and if so you would need to extend it a bit..

Hope this helps,

qaz zaq <> wrote on 03/10/2006 09:50:24:

> Hi,
>   I have a question about the lucene scoring. In my following
> example, how can I ensure the doc1 has the higher score than doc2,
> if I search for "A*". In another words, I want to boost the docs
> which match their leading terms.
>   doc1: Aterm  Bterm  Cterm
>   doc2: Bterm  Aterm  Cterm
>  __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message