nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2449) Usage of Tika LanguageIdentifier in language-identifier plugin
Date Wed, 06 Jun 2018 07:41:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16502919#comment-16502919
] 

ASF GitHub Bot commented on NUTCH-2449:
---------------------------------------

sebastian-nagel commented on issue #233: NUTCH-2449: Replace Tika LanguageIdentifier in language-identifier
URL: https://github.com/apache/nutch/pull/233#issuecomment-394972277
 
 
   Ok, better to update the tests. Thanks for the hint, @YossiTamari!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Usage of Tika LanguageIdentifier in language-identifier plugin
> --------------------------------------------------------------
>
>                 Key: NUTCH-2449
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2449
>             Project: Nutch
>          Issue Type: Improvement
>          Components: plugin
>    Affects Versions: 1.13
>            Reporter: Yossi Tamari
>            Priority: Major
>
> The language-identifier plugin uses org.apache.tika.language.LanguageIdentifier for extracting
the language from the document text. There are two issues with that:
> # LanguageIdentifier is deprecated in Tika.
> # It does not support CJK language (and I suspect a lot of other languages - https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Languages_and_their_ISO_636_Codes),
and it doesn’t even fail gracefully with them - in my experience Chinese was recognized
as Italian.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message