lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joshua O'Madadhain <>
Subject Re: text format and scoring
Date Fri, 02 Aug 2002 23:14:53 GMT
On Sat, 3 Aug 2002, petite_abeille wrote:

> I was wandering what would be a good way to incorporate text format 
> information in Lucene word/document scoring. For example, when turning 
> HTML into plain text for indexing purpose, a lot of potentially useful 
> information are lost: eg tags like <bold>, <strong> and so on could be 
> understood as conveying emphasis information about some words. If 
> somebody took the pain to "underline" some words, why throw it away? 
> Assuming there is some interesting meaning in a document format/layout, 
> and a way to understand it and weight it, how could one incorporate this 
> information into document scoring?

If you can boost terms as they are indexed (I can't remember if this is
possible, but you can certainly do so on queries) then that might be a
good way of doing it; it's not so much a matter of changing document
scores (on the back end, with respect to a particular query) as it is of
changing the weighting of terms (on the front end).

I've just glanced through the API and I don't see a way to do term
boosting during indexing, but maybe there's something I've missed.  


Joshua O'Madadhain Per
  Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message