lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] [Commented] (LUCENE-4198) Allow codecs to index term impacts
Date Thu, 04 Jan 2018 06:22:00 GMT


Robert Muir commented on LUCENE-4198:

The similarity API doesn't make it easy to integrate, it currently gives a score(docID, freq)
API while we'd rather need a score(freq,norm) API, especially because this optimization only
works if freq and norm are the only per-document parameters that can influence the score.

Well I think it is fair game to simplify the api so its not strange, i mean we need to fix
it so you can make changes like this :) A lot of the stuff in Similarity was geared at just
hiding away the classic tf/idf stuff so that other things can work. But it should be the term
weighting api and limited to that, and there are only 3 components of that: term specificity,
term frequency, doc length.

Simple example: boosting doesn't need to be in this api, its only there because it was needed
for crazy queryNorm before. But it never belonged and it just adds complexity that isn't needed
(and bugs if you forget to multiply it in).

But along the path of this change, I think its best to change the api to score(freq,norm).
But i don't think we should use a Long/boxing, we could just call score(freq,1) for the omitNorms
case and thats it (similar to how we pass freq=1 when frequencies are omitted). Seems like
it would simplify things there. This is already what SimilarityBase is doing internally, and
it doesn't much matter what you substitute in there.

> Allow codecs to index term impacts
> ----------------------------------
>                 Key: LUCENE-4198
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Sub-task
>          Components: core/index
>            Reporter: Robert Muir
>         Attachments: LUCENE-4198.patch, LUCENE-4198_flush.patch
> Subtask of LUCENE-4100.
> Thats an example of something similar to impact indexing (though, his implementation
currently stores a max for the entire term, the problem is the same).
> We can imagine other similar algorithms too: I think the codec API should be able to
support these.
> Currently it really doesnt: Stefan worked around the problem by providing a tool to 'rewrite'
your index, he passes the IndexReader and Similarity to it. But it would be better if we fixed
the codec API.
> One problem is that the Postings writer needs to have access to the Similarity. Another
problem is that it needs access to the term and collection statistics up front, rather than
after the fact.
> This might have some cost (hopefully minimal), so I'm thinking to experiment in a branch
with these changes and see if we can make it work well.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message