lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolás Lichtmaier <nicol...@wolfram.com>
Subject Multi-field IDF
Date Thu, 17 Nov 2016 18:09:04 GMT
IDF measures the selectivity of a term. But the calculation is 
per-field. That can be bad for very short fields (like titles). One 
example of this problem: If I don't delete stop words, then "or", "and", 
etc. should be dealt with low IDF values, however "or" is, perhaps, not 
so usual in titles. Then, "or" will have a high IDF value and be treated 
as an important term. That's bad.

One solution I see is to modify the Similarity to have a global, or 
multi-field IDF value. This value would include in its calculation 
longer fields that has more "normal text"-like stats. However this is 
not trivial because I can't just add document-frequencies (I would be 
counting some documents several times if "or" is present in more than 
one field). I would need need to OR the bit-vectors that signal the 
presence of the term, right? Not trivial.

Has anyone encountered this issue? Has it been solved? Is my thinking wrong?

Should I also try the developers' list?

Thanks!

Nicolás.-

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message