lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Will Martin <wmartin...@gmail.com>
Subject Re: Multi-field IDF
Date Fri, 18 Nov 2016 13:13:35 GMT
In this work, we aim to improve the field weighting for structured doc-
ument retrieval. We first introduce the notion of field relevance as the
generalization of field weights, and discuss how it can be estimated using
relevant documents, which effectively implements relevance feedback for
field weighting. We then propose a framework for estimating field rele-
vance based on the combination of several sources. Evaluation on several
structured document collections show that field weighting based on the
suggested framework improves retrieval effectiveness signicantly.


https://ciir-publications.cs.umass.edu/pub/web/getpdf.php?id=1051




On 11/18/2016 3:57 AM, Ahmet Arslan wrote:
> Hi Nicholas,
>
> Aha, I see that you are into field-based scoring, which is an unsolved problem.
>
> Then, you might find BlendedTermQuery and SynonymQuery relevant.
>
> Ahmet
>
>
>
>
> On Friday, November 18, 2016 12:22 AM, Nicolás Lichtmaier <nicolasl@wolfram.com>
wrote:
> That depends on what you want. In this case I want to use a
> discrimination power based in all the body text, not just the titles.
> Because otherwise terms that are really not that relevant end up being
> very high!
>
>
> El 17/11/16 a las 18:25, Ahmet Arslan escribió:
>> Hi Nicholas,
>>
>> IDF, among others, is a measure of term specificity. If 'or' is not so usual in titles,
then it has some discrimination power in that domain.
>>
>> I think it's OK 'or' to get a high IDF value in this case.
>>
>> Ahmet
>>
>>
>>
>> On Thursday, November 17, 2016 9:09 PM, Nicolás Lichtmaier <nicolasl@wolfram.com>
wrote:
>> IDF measures the selectivity of a term. But the calculation is
>> per-field. That can be bad for very short fields (like titles). One
>> example of this problem: If I don't delete stop words, then "or", "and",
>> etc. should be dealt with low IDF values, however "or" is, perhaps, not
>> so usual in titles. Then, "or" will have a high IDF value and be treated
>> as an important term. That's bad.
>>
>> One solution I see is to modify the Similarity to have a global, or
>> multi-field IDF value. This value would include in its calculation
>> longer fields that has more "normal text"-like stats. However this is
>> not trivial because I can't just add document-frequencies (I would be
>> counting some documents several times if "or" is present in more than
>> one field). I would need need to OR the bit-vectors that signal the
>> presence of the term, right? Not trivial.
>>
>> Has anyone encountered this issue? Has it been solved? Is my thinking wrong?
>>
>> Should I also try the developers' list?
>>
>> Thanks!
>>
>> Nicolás.-
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message