LanguageIdentifier API enhancements
-----------------------------------
Key: TIKA-465
URL: https://issues.apache.org/jira/browse/TIKA-465
Project: Tika
Issue Type: Improvement
Components: languageidentifier
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Minor
As originally reported by Jerome Charron in NUTCH-86, Jerome identified a set of improvements
for the LanguageIdentifier that we should consider in Tika:
{quote}
More informations can be found on the following thread on Nutch-Dev mailing list:
http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00569.html
Summary:
1. LanguageIdentifier API changes. The similarity methods should return an ordered array of
language-code/score pairs instead of a simple String containing the language-code.
2. Ensure consistency between LanguageIdentifier scoring and NGramProfile.getSimilarity().
{quote}
I just wanted to capture the issue here in Tika, since I'm about to close it out in Nutch
since LanguageIdentification is something that can happen in Tika-ville...
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
|