lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From petite_abeille <>
Subject text format and scoring
Date Fri, 02 Aug 2002 22:44:52 GMT

I was wandering what would be a good way to incorporate text format 
information in Lucene word/document scoring. For example, when turning 
HTML into plain text for indexing purpose, a lot of potentially useful 
information are lost: eg tags like <bold>, <strong> and so on could be 
understood as conveying emphasis information about some words. If 
somebody took the pain to "underline" some words, why throw it away? 
Assuming there is some interesting meaning in a document format/layout, 
and a way to understand it and weight it, how could one incorporate this 
information into document scoring?

Thanks for any insights :-)


To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message