tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <j...@apache.org>
Subject [jira] Created: (TIKA-465) LanguageIdentifier API enhancements
Date Wed, 14 Jul 2010 17:40:51 GMT
LanguageIdentifier API enhancements
-----------------------------------

                 Key: TIKA-465
                 URL: https://issues.apache.org/jira/browse/TIKA-465
             Project: Tika
          Issue Type: Improvement
          Components: languageidentifier
            Reporter: Chris A. Mattmann
            Assignee: Chris A. Mattmann
            Priority: Minor


As originally reported by Jerome Charron in NUTCH-86, Jerome identified a set of improvements
for the LanguageIdentifier that we should consider in Tika:

{quote}
More informations can be found on the following thread on Nutch-Dev mailing list:
http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00569.html

Summary:

1. LanguageIdentifier API changes. The similarity methods should return an ordered array of
language-code/score pairs instead of a simple String containing the language-code.

2. Ensure consistency between LanguageIdentifier scoring and NGramProfile.getSimilarity().
{quote}

I just wanted to capture the issue here in Tika, since I'm about to close it out in Nutch
since LanguageIdentification is something that can happen in Tika-ville...


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message