nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gavin Thomas Nicol <...@rbii.com>
Subject Re: Detecting CJKV / Asian language pages
Date Mon, 01 Aug 2005 18:54:36 GMT

On Aug 1, 2005, at 12:25 PM, Andy Liu wrote:

> The current Nutch language identifier plugin currently doesn't handle
> CJKV pages.  Does anybody here have any experience with automatically
> detecting the language of such pages?
>
> I know there are specific encodings which give away what language the
> page is, but for Asian language pages that use unicode or its
> variants, I'm out of luck.

For Unicode it's pretty easy... just look for characters that give  
away the language... for example, Hiragana for Japanese, Hangul for  
Korean, etc.

It's hard to detect all the various encodings... EUC-JP, SHIFT-JIS,  
ISO-2022-KR/JP, BIG5, etc. and many servers do not correctly identify  
the encodings.


Mime
View raw message