nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fengtan (JIRA)" <j...@apache.org>
Subject [jira] [Created] (NUTCH-2278) Handle alpha-2 language codes consistently
Date Sat, 11 Jun 2016 02:56:20 GMT
Fengtan created NUTCH-2278:
------------------------------

             Summary: Handle alpha-2 language codes consistently
                 Key: NUTCH-2278
                 URL: https://issues.apache.org/jira/browse/NUTCH-2278
             Project: Nutch
          Issue Type: Improvement
          Components: plugin
    Affects Versions: 1.12
            Reporter: Fengtan
            Priority: Minor


The language-identifier plugin provides two extraction policies: detect and identify.

However the two policies handle [alpha-2|https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2]
codes differently:
* 'identify' strips out the alpha-2 code e.g. if the identified language is 'en-US' then it
will inject 'en' in the meta tags
* 'detect' does not strip out the alpha-2 code e.g. if the detected language is 'en-US' then
it will inject 'en-US' in the meta tags

Any chance we can make this consistent and always strip out the alpha-2 code ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message