lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Olli Kuonanoja (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (LUCENE-8501) An ability to define the sum method for custom term frequencies
Date Tue, 18 Sep 2018 14:30:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-8501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16619192#comment-16619192
] 

Olli Kuonanoja edited comment on LUCENE-8501 at 9/18/18 2:29 PM:
-----------------------------------------------------------------

{quote}

Do you know how many values per field you expect at most? For instance using 24 bits by shifting
the bits of the float representation right by 7 instead of 15 would retain more accuracy while
allowing for about 128 values per field per document. In general scoring doesn't focus on
accuracy: we are happy with recording lengths on a single byte, using Math.log(1+x) rather
than Math.log1p(x) or tweaking scoring formulas to add ones if it can help avoid dividing
by zero. Better accuracy doesn't improve ranking significantly.

{quote}

Need to support some thousands at max at the moment so it becomes tricky. In theory the frequencies
could be represented in a very concise way by examining the values and sorting them. In practise,
when using a distributed system to calculate them, it is infeasible to find such ordering.

 

{quote}

It might... but such extension points have a significant impact on the API and testing. In
general we'd rather not add them unless there is a strong case to introduce them. Also there
are ramifications: if we change the way that the length is computed, then we also need to
change the way that frequencies are combined when a field has the same value twice, we also
need to worry about how to reflect it on index statistics like totalTermFreq and sumTotalTermFreq,
etc.

{quote}

Understood, now after mentioning things like totalTermFreq and sumTotalTermFreq, probably
the whole "algebra" related to these should be interfaced to implement that properly. If that
change is not in the line of the project, I guess we can just close this issue and live with
workarounds.


was (Author: ollik1):
{{{quote}}}

{{Do you know how many values per field you expect at most? For instance using 24 bits by
shifting the bits of the float representation right by 7 instead of 15 would retain more accuracy
while allowing for about 128 values per field per document. In general scoring doesn't focus
on accuracy: we are happy with recording lengths on a single byte, using Math.log(1+x) rather
than Math.log1p(x) or tweaking scoring formulas to add ones if it can help avoid dividing
by zero. Better accuracy doesn't improve ranking significantly.}}

{{{quote}}}{{}}

{{Need to support some thousands at max at the moment so it becomes tricky. In theory the
frequencies could be represented in a very concise way by examining the values and sorting
them. In practise, when using a distributed system to calculate them, it is infeasible to
find such ordering.}}

 

{{{quote}}}

{{It might... but such extension points have a significant impact on the API and testing.
In general we'd rather not add them unless there is a strong case to introduce them. Also
there are ramifications: if we change the way that the length is computed, then we also need
to change the way that frequencies are combined when a field has the same value twice, we
also need to worry about how to reflect it on index statistics like totalTermFreq and sumTotalTermFreq,
etc.}}

{{{quote}}}

{{Understood, now after mentioning things like totalTermFreq and sumTotalTermFreq, probably
the whole "algebra" related to these should be interfaced to implement that properly. If that
change is not in the line of the project, I guess we can just close this issue and live with
workarounds.}}

> An ability to define the sum method for custom term frequencies
> ---------------------------------------------------------------
>
>                 Key: LUCENE-8501
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8501
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Olli Kuonanoja
>            Priority: Major
>
> Custom term frequencies allows expert users to index and score in custom ways, however,
_DefaultIndexingChain_ adds a limitation to this as the sum of frequencies can't overflow
> {code:java}
> try {
>     invertState.length = Math.addExact(invertState.length, invertState.termFreqAttribute.getTermFrequency());
> } catch (ArithmeticException ae) {
>     throw new IllegalArgumentException("too many tokens for field \"" + field.name()
+ "\"");
> }
> {code}
> This might become an issue if for example the frequency data is encoded in a different
way, say the specific scorer works with float frequencies.
> The sum method can be added to _TermFrequencyAttribute_ to get something like
> {code:java}
> invertState.length = invertState.termFreqAttribute.addFrequency(invertState.length);
> {code}
> so users may define the summing method and avoid the owerflow exceptions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message