lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From petite_abeille <>
Subject Re: text format and scoring
Date Sat, 03 Aug 2002 21:46:50 GMT
Hi Alex,

On Saturday, August 3, 2002, at 11:13 , Alex Murzaku wrote:

> Hi PA! How are things going?

Doing all right :-)

> It's an interesting question but I don't think Lucene
> (as it is today) could change weights based on
> semantics (either assigned by formatting tags or maybe
> looked up in some dictionary like WordNet)...

Ummm... I see.

> Some time ago, Doug sent to this list the formula for
> the score computation which is:


> The only thing that counts is the frequency of the
> terms in the document and among documents.
> A way to influence the final score might be to tweak
> the real frequencies during indexing with some
> parameters configured externally. Let's say if the
> word is underlined then multiply its count by X. This
> modified TF should influence the final score
> accordingly.
> Just a thought...

I see. That's what I'm basically doing right now somehow: I index a 
document multiple time (eg an email could be indexed by subject, first 
sentence and body content). Then I do multiple searches. And use a 
"ranking comparator" to evaluate the result based on how many time I get 
a specific document plus its Lucene scores and other funky heuristics. 
Which seems to work ok, but is kind of cumbersome :-( Same deal for 
finding "related" document. Lucene is very good for finding "similar" 
document, but for "related" (think "cluster" ;-), I basically end up 
doing some term categorization and assign some multiplying factor for 
each term category. Which then I feed to Lucene to get something more 
akin to a "cluster" of document...

In any case, I was simply wandering if there was a more straightforward 
way of doing things.



To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message