tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ted Dunning (JIRA)" <j...@apache.org>
Subject [jira] Commented: (TIKA-209) Language detection is weak.
Date Wed, 15 Jul 2009 18:50:15 GMT

    [ https://issues.apache.org/jira/browse/TIKA-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12731621#action_12731621
] 

Ted Dunning commented on TIKA-209:
----------------------------------


I haven't looked at the nutch code in forever, but my memory is that it didn't use the best
statistics for the task.  Here is an approach that seems to be more accurate:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.48.1958

Sadly, I don't have a Java implementation of this handy.  I can give out an ancient C implementation.





> Language detection is weak.
> ---------------------------
>
>                 Key: TIKA-209
>                 URL: https://issues.apache.org/jira/browse/TIKA-209
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 0.3
>            Reporter: Robert Newson
>
> in org.apache.tika.utils.Utils the getUTF8Reader method assigns a language determination
without checking the confidence rating from ICU's CharsetDetector.
> Please add a configurable level (0-100);
> if (language != null && match.getConfidence() > THRESHOLD) {
>   metadata.set(Metadata.CONTENT_LANGUAGE, match.getLanguage());
>   metadata.set(Metadata.LANGUAGE, match.getLanguage());
> }
> Obviously using charset to imply language is generally weak but it would be sufficient
if the confidence threshold was controlled. Today, the text "hello" is tagged as French, for
example. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message