nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2449) Usage of Tika LanguageIdentifier in language-identifier plugin
Date Tue, 05 Jun 2018 15:46:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16501969#comment-16501969
] 

ASF GitHub Bot commented on NUTCH-2449:
---------------------------------------

sebastian-nagel commented on issue #233: NUTCH-2449: Replace Tika LanguageIdentifier in language-identifier
URL: https://github.com/apache/nutch/pull/233#issuecomment-394760001
 
 
   Hi @YossiTamari, finally I've found the time to test the PR. Fetching your branch failed,
to resolve conflicts I've created a [new branch](https://github.com/sebastian-nagel/nutch/tree/YossiTamari-NUTCH-2449)
and applied your patch. One trivial fix: still need to copy `langmapping.properties` (used
to parse HTML lang attribute) to runtime. Everything works fine! If there are no objection
I'll merge soon. Thanks!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Usage of Tika LanguageIdentifier in language-identifier plugin
> --------------------------------------------------------------
>
>                 Key: NUTCH-2449
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2449
>             Project: Nutch
>          Issue Type: Improvement
>          Components: plugin
>    Affects Versions: 1.13
>            Reporter: Yossi Tamari
>            Priority: Major
>
> The language-identifier plugin uses org.apache.tika.language.LanguageIdentifier for extracting
the language from the document text. There are two issues with that:
> # LanguageIdentifier is deprecated in Tika.
> # It does not support CJK language (and I suspect a lot of other languages - https://wiki.apache.org/nutch/LanguageIdentifierPlugin#Implemented_Languages_and_their_ISO_636_Codes),
and it doesn’t even fail gracefully with them - in my experience Chinese was recognized
as Italian.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message