nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Dean (JIRA)" <>
Subject [jira] Commented: (NUTCH-224) Nutch doesn't handle Korean text at all
Date Sat, 02 Dec 2006 01:41:22 GMT
    [ ] 
Sean Dean commented on NUTCH-224:

I just tested this today using 0.9-dev and it seems the changes made back in 0.7.2 to Lucene
didnt fix the issue. At some point in the Nutch code it isnt doing something the same way
as for Chinese and Japanese. Im also aware that searching using Chinese has an issue, which
is in ticket NUTCH-36 but still does show results exactly matching.

Testing details:

I searched for the word "뉴스", which is "news" in english. I have fetched korean pages
with this word, so I know for sure its part of the index. Zero results were displayed.

> Nutch doesn't handle Korean text at all
> ---------------------------------------
>                 Key: NUTCH-224
>                 URL:
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 0.7.1
>            Reporter: KuroSaka TeruHiko
> I was browing NutchAnalysis.jj and found that
> Hungul Syllables (U+AC00 ... U+D7AF; U+xxxx means
> a Unicode character of the hex value xxxx) are not
> part of LETTER or CJK class.  This seems to me that
> Nutch cannot handle Korean documents at all.
> I posted the above message at nutch-user ML and Cheolgoo Kang []
> replied as:
> ------------------------------------------------------------------------------------
> There was similar issue with Lucene's StandardTokenizer.jj.
> and
> I'm have almost no experience with Nutch, but you can handle it like
> those issues above.
> ------------------------------------------------------------------------------------
> Both fixes should probably be ported back to NuatchAnalysis.jj.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:


View raw message