nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1397) language-identifier incorrectly handles double-barreled language properties
Date Fri, 15 Jun 2012 16:02:42 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295738#comment-13295738
] 

Julien Nioche commented on NUTCH-1397:
--------------------------------------

Lewis, the language identification is a combination of parsing of the html (done in Nutch)
with statistical guessing (from Tika). The parser component ignores compound values and returns
only the main language code, as for the statistical component is returns only the 2 letter
code (and given how bad it is at it, I don't think it would be wise to try and get it to be
more specific). In a nutshell these compound language codes are not supported in Nutch. We
could possible store a separate value with the secondary code when available from the parsing
but not the identifier.
Makes sense?
                
> language-identifier incorrectly handles double-barreled language properties
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-1397
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1397
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: nutchgora, 1.5
>            Reporter: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 1.6, 2.1
>
>
> Currently when language-identifier is activated is parses and identifies langauge-type=en,
however does not identify en-GB or en-US. This issues should correct that. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message