lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmet Arslan <iori...@yahoo.com.INVALID>
Subject Re: Multi-field IDF
Date Thu, 17 Nov 2016 21:25:08 GMT
Hi Nicholas,

IDF, among others, is a measure of term specificity. If 'or' is not so usual in titles, then
it has some discrimination power in that domain.

I think it's OK 'or' to get a high IDF value in this case.

Ahmet



On Thursday, November 17, 2016 9:09 PM, Nicolás Lichtmaier <nicolasl@wolfram.com> wrote:
IDF measures the selectivity of a term. But the calculation is 
per-field. That can be bad for very short fields (like titles). One 
example of this problem: If I don't delete stop words, then "or", "and", 
etc. should be dealt with low IDF values, however "or" is, perhaps, not 
so usual in titles. Then, "or" will have a high IDF value and be treated 
as an important term. That's bad.

One solution I see is to modify the Similarity to have a global, or 
multi-field IDF value. This value would include in its calculation 
longer fields that has more "normal text"-like stats. However this is 
not trivial because I can't just add document-frequencies (I would be 
counting some documents several times if "or" is present in more than 
one field). I would need need to OR the bit-vectors that signal the 
presence of the term, right? Not trivial.

Has anyone encountered this issue? Has it been solved? Is my thinking wrong?

Should I also try the developers' list?

Thanks!

Nicolás.-

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message