tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James Sullivan (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-856) Support CJK (Chinese, Japanese and Korean) language detection
Date Mon, 20 Feb 2012 04:33:34 GMT

    [ https://issues.apache.org/jira/browse/TIKA-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13211664#comment-13211664
] 

James Sullivan commented on TIKA-856:
-------------------------------------

I gave it a shot this weekend using Jan H.'s instructions with a Japanese corpus I put together,
which coincidentally used the same Wikipeda entries but not the tool Jan R.mentions. I could
not get good results even playing around a little with what was included in the corpus. I
could well have screwed something basic up but I suspect there is more to this than just generating
a profile. Initially I thought the results would  be perfect given the lack of overlap between
latin and Japanese character sets but looking at the only 1,000 lines in the .ngp file and
knowing that there are 2,136 characters in Joyo Kanji alone I suspect some modifications are
going to need to be made to the current implementation for this to work.
                
> Support CJK (Chinese, Japanese and Korean) language detection
> -------------------------------------------------------------
>
>                 Key: TIKA-856
>                 URL: https://issues.apache.org/jira/browse/TIKA-856
>             Project: Tika
>          Issue Type: New Feature
>          Components: languageidentifier
>    Affects Versions: 1.0
>         Environment: All
>            Reporter: James Sullivan
>              Labels: Chinese, Japanese
>
> Support language detection of CJK (Chinese, Japanese and Korean).
> Some estimates have Chinese users overtaking English users on the Internet  so it is
important that these languages used by large number of people be supported.
> See TIKA-855

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message