[ https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13573753#comment-13573753
]
Robert Muir commented on TIKA-369:
----------------------------------
The DetectorFactory is definitely gnarly, but you can load the JSON of the profiles
yourself from resource files e.g. in the JAR and use loadProfile(List<String> json_profiles).
This is how solr worked around the issue of wanting to bundle profiles easily in the JAR.
> Improve accuracy of language detection
> --------------------------------------
>
> Key: TIKA-369
> URL: https://issues.apache.org/jira/browse/TIKA-369
> Project: Tika
> Issue Type: Improvement
> Components: languageidentifier
> Affects Versions: 0.6
> Reporter: Ken Krugler
> Assignee: Ken Krugler
> Attachments: lingdet-mccs.pdf, Surprise and Coincidence.pdf, textcat.pdf
>
>
> Currently the LanguageProfile code uses 3-grams to find the best language profile using
Pearson's chi-square test. This has three issues:
> 1. The results aren't very good for short runs of text. Ted Dunning's paper (attached)
indicates that a log-likelihood ratio (LLR) test works much better, which would then make
language detection faster due to less text needing to be processed.
> 2. The current LanguageIdentifier.isReasonablyCertain() method uses an exact value as
a threshold for certainty. This is very sensitive to the amount of text being processed, and
thus gives false negative results for short runs of text.
> 3. Certainty should also be based on how much better the result is for language X, compared
to the next best language. If two languages both had identical sum-of-squares values, and
this value was below the threshold, then the result is still not very certain.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
|