lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Olli Kuonanoja (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-8501) An ability to define the sum method for custom term frequencies
Date Mon, 17 Sep 2018 12:58:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-8501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16617478#comment-16617478
] 

Olli Kuonanoja commented on LUCENE-8501:
----------------------------------------

Thanks for the pointer [~jpountz]. In my case I'd be a bit worried about the loss of precision
with the 16 bit encoding, can't say for sure without proper testing how much it would affect
the results. However, the storage efficiency has not been an issue for me in practise. One
more issue I forgot to point out in the original description is the value of _invertState.length_ becomes
useless for similarities as it is always the sum of the integer representations. Using a fixed
point encoding would be a workaround for that but I'm still wondering would it make sense to
allow the users to overwrite the sum function for different use-cases.

> An ability to define the sum method for custom term frequencies
> ---------------------------------------------------------------
>
>                 Key: LUCENE-8501
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8501
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Olli Kuonanoja
>            Priority: Major
>
> Custom term frequencies allows expert users to index and score in custom ways, however,
_DefaultIndexingChain_ adds a limitation to this as the sum of frequencies can't overflow
> {code:java}
> try {
>     invertState.length = Math.addExact(invertState.length, invertState.termFreqAttribute.getTermFrequency());
> } catch (ArithmeticException ae) {
>     throw new IllegalArgumentException("too many tokens for field \"" + field.name()
+ "\"");
> }
> {code}
> This might become an issue if for example the frequency data is encoded in a different
way, say the specific scorer works with float frequencies.
> The sum method can be added to _TermFrequencyAttribute_ to get something like
> {code:java}
> invertState.length = invertState.termFreqAttribute.addFrequency(invertState.length);
> {code}
> so users may define the summing method and avoid the owerflow exceptions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message