lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christoph Kiehl ...@sulu3000.de>
Subject Re: Use a date field for ranking
Date Sat, 08 Jan 2005 02:20:09 GMT
Chris Hostetter wrote:

> : we are currently implementing a search engine for a news site. Our goal
> : is to have a search result that uses the publish date of the documents
> : to boost the score of the documents.
> 
> : have to use something that boosts the scores at _search_ time.
 >
> 1) There is a way to boost individual Query objects (which you may then
> compose into a Tree of BooleanQueries) see Query.setBoost(float)

Yes, I know I can boost Query objects, but that is not the same as 
boosting the document score by a factor. By boosting query objects I 
_add_ values to the score. Let me show you an example:

I may use queries like this:

Query 1:
(a word that gets a score of 0.1) OR (date:20050108^3 OR date:20050107^1)

Query 2:
(a word that gets a score of 0.01) OR (date:20050108^3 OR date:20050107^1)

The date part of the clause gets a constant score of 0.3. So the total 
score of the queries will be:

Query 1: 0.4
Query 2: 0.31

If I had used a boost of 3.0 per document and left the date part of the 
query out I would have:

Query 1: 0.3
Query 2: 0.03

Which maintains the original proportion. Now if I want to specify a 
function (like 1/x) that calculates the boost factor of a specific 
publish date I can't emulate this by using Query boosts because the 
query boost must be adjusted to the first part of the query to achieve 
an equal distribution for any query.

I'm sure there is a mathematical term which describes exactly this 
problem - but I'm no mathematician ;) So I hope you understand my issues.

Additionally the construct above find also documents that have the right 
date but don't contain the first part of the query. So we might use a 
query like this:

(a word) AND (date:20050108^3 OR date:20050107^1)

But now I have to specify _all_ possible dates in the date part to reach 
all documents the index contains. This smells ;) Because it's all only 
an emulation of the real strategy.


> 2) if you are planning to rebuild your index on a regular basis (ie:
> nightly) then you can easily apply boosts to your documets when you index
> them.

Unfortunately this is no option because the index is updated incrementally.

> 3) I'm sure there is a very cool and efficient way to do this using a
> custom Similarity implimentation (which somhow causes the default score
> to be divided by the age of the document) but i've never acctualy played
> with the SImilarity class, so i won't say for certain it can be done that
> way (hopefully someone else can chime in)

AFAIK, Similarity can only be used on term level. But as outlined above 
I need a boost factor on document level.

Thanks for your input,
Christoph


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message