tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ken Krugler (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp
Date Mon, 03 Dec 2018 15:13:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16707343#comment-16707343
] 

Ken Krugler commented on TIKA-2790:
-----------------------------------

My concern with OpenNLP is that during a web crawl, even with the current lightweight detection
algorithm, the detection can add a lot of processing time. OpenNLP is generally not known
as being "lightweight" :) But we could give it a try, for sure.

Note that OpenNLP uses ISO 639-2 (three letter codes). Having a more robust representation
of languages in the language detector API would be a good thing in general (e.g. 639-2 code
plus an optional locale code, so you can differentiate Mandarin Chinese in Taiwan from Mandarin
Chinese in China or Singapore).

> Consider switching lang-detection in tika-eval to open-nlp
> ----------------------------------------------------------
>
>                 Key: TIKA-2790
>                 URL: https://issues.apache.org/jira/browse/TIKA-2790
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message