lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolás Lichtmaier <nicol...@wolfram.com>
Subject Re: Multi-field IDF
Date Thu, 17 Nov 2016 22:20:50 GMT
That depends on what you want. In this case I want to use a 
discrimination power based in all the body text, not just the titles. 
Because otherwise terms that are really not that relevant end up being 
very high!


El 17/11/16 a las 18:25, Ahmet Arslan escribió:
> Hi Nicholas,
>
> IDF, among others, is a measure of term specificity. If 'or' is not so usual in titles,
then it has some discrimination power in that domain.
>
> I think it's OK 'or' to get a high IDF value in this case.
>
> Ahmet
>
>
>
> On Thursday, November 17, 2016 9:09 PM, Nicolás Lichtmaier <nicolasl@wolfram.com>
wrote:
> IDF measures the selectivity of a term. But the calculation is
> per-field. That can be bad for very short fields (like titles). One
> example of this problem: If I don't delete stop words, then "or", "and",
> etc. should be dealt with low IDF values, however "or" is, perhaps, not
> so usual in titles. Then, "or" will have a high IDF value and be treated
> as an important term. That's bad.
>
> One solution I see is to modify the Similarity to have a global, or
> multi-field IDF value. This value would include in its calculation
> longer fields that has more "normal text"-like stats. However this is
> not trivial because I can't just add document-frequencies (I would be
> counting some documents several times if "or" is present in more than
> one field). I would need need to OR the bit-vectors that signal the
> presence of the term, right? Not trivial.
>
> Has anyone encountered this issue? Has it been solved? Is my thinking wrong?
>
> Should I also try the developers' list?
>
> Thanks!
>
> Nicolás.-
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message