[ https://issues.apache.org/jira/browse/TIKA-369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13573541#comment-13573541 ] Ted Dunning commented on TIKA-369: ---------------------------------- It is hard to object, but it would be good to replicate the accuracy numbers on the kind of text that Tika typically sees. > Improve accuracy of language detection > -------------------------------------- > > Key: TIKA-369 > URL: https://issues.apache.org/jira/browse/TIKA-369 > Project: Tika > Issue Type: Improvement > Components: languageidentifier > Affects Versions: 0.6 > Reporter: Ken Krugler > Assignee: Ken Krugler > Attachments: lingdet-mccs.pdf, Surprise and Coincidence.pdf, textcat.pdf > > > Currently the LanguageProfile code uses 3-grams to find the best language profile using Pearson's chi-square test. This has three issues: > 1. The results aren't very good for short runs of text. Ted Dunning's paper (attached) indicates that a log-likelihood ratio (LLR) test works much better, which would then make language detection faster due to less text needing to be processed. > 2. The current LanguageIdentifier.isReasonablyCertain() method uses an exact value as a threshold for certainty. This is very sensitive to the amount of text being processed, and thus gives false negative results for short runs of text. > 3. Certainty should also be based on how much better the result is for language X, compared to the next best language. If two languages both had identical sum-of-squares values, and this value was below the threshold, then the result is still not very certain. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira