tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ken Krugler (JIRA)" <j...@apache.org>
Subject [jira] Created: (TIKA-369) Improve accuracy of language detection
Date Sun, 24 Jan 2010 18:52:17 GMT
Improve accuracy of language detection

                 Key: TIKA-369
                 URL: https://issues.apache.org/jira/browse/TIKA-369
             Project: Tika
          Issue Type: Improvement
          Components: languageidentifier
    Affects Versions: 0.6
            Reporter: Ken Krugler
            Assignee: Ken Krugler

Currently the LanguageProfile code uses 3-grams to find the best language profile using Pearson's
chi-square test. This has three issues:

1. The results aren't very good for short runs of text. Ted Dunning's paper (attached) indicates
that a Lucas-Lehmer-Riesel (LLR) test works much better, which would then make language detection
faster due to less text needing to be processed. It might be sufficient to re-enable support
for 1..4-grams (similar to original Nutch code) to improve quality.
2. The current LanguageIdentifier.isReasonablyCertain() method uses an exact value as a threshold
for certainty. This is very sensitive to the amount of text being processed, and thus gives
false negative results for short runs of text.
3. Certainty should also be based on how much better the result is for language X, compared
to the next best language. If two languages both had identical sum-of-squares values, and
this value was below the threshold, then the result is still not very certain.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message