tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl (JIRA) <j...@apache.org>
Subject [jira] Commented: (TIKA-568) Language Detection isReasonablyCertain() hides valuable information
Date Sun, 05 Dec 2010 15:24:11 GMT

    [ https://issues.apache.org/jira/browse/TIKA-568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966969#action_12966969

Jan Høydahl commented on TIKA-568:

This will help short-term by giving user better control over the threshold. But it will still
be an absolute "internal" value with its own problems - very hard to tune it so that it works
well for general purpose.

Long term, TIKA-369 and TIKA-496 should be solved, allowing us to compute a more reliable
measure of how certain a classification is.

> Language Detection isReasonablyCertain() hides valuable information
> -------------------------------------------------------------------
>                 Key: TIKA-568
>                 URL: https://issues.apache.org/jira/browse/TIKA-568
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>            Priority: Minor
>         Attachments: TIKA-568.patch
> LanguageIdentifier.isReasonablyCertain() hardcodes a threshold for language detection,
which is fine, except applications should be allowed to decide what threshold suits them.
 For instance, how was 0.022 decided?

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message