nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "cao yuzhong" <>
Subject RE: A problem about Chinese word segment
Date Thu, 17 Mar 2005 02:27:40 GMT
No anwser for this?
Any tips are appreciated.

>From: "cao yuzhong" <>
>Subject: A problem about Chinese word segment
>Date: Tue, 15 Mar 2005 05:16:30 +0000
>Now,Nutch-0.6 simply treats a Chinese character as a single token.
>I have attempted to make it treating some relative Chinese 
>characters(called Chinese word) as a token.
>So I need to modified the Analyzer.
>First,I modified the file NutchAnalysis.jj in 
>I changed " <SIGRAM: <CJK> > " to " <SIGRAM: (<CJK>)+ > " so that

>Nutch can
>treat one or more Chinese characters as a token. Then I used JavaCC 
>to generate the code.
>Second,I have to segment Chinese texts into Chinese words(insert 
>space between two Chinese words) before indexing so that Nutch can 
>recognize them.I have written a class
>to do this and I have modified the function refill() in 
>below the line :
>int charsRead, newPosition, 
>I added:
>String str=new String(buffer,newPostion,charsRead);
>//do Chinese word segment,fox example
>//if str1="中文搜索引擎的分词问题"
>//then str2 will be "中文 搜索引擎 的 分词 问题"
>String str2 = Spliter.segSentence(str1);
>while(str2.length()>buffer.length-newPosition){  //expand the buffer
>          char[] newBuffer = new char[buffer.length*2];
>          System.arraycopy(buffer, 0, newBuffer, 0, buffer.length);
>          buffer = newBuffer;
>for(int i=0;i<str2.length();i++){
>            buffer[newPosition+i]=str2.charAt(i);
>  }
>Third, compiling... ,running CrawlTool....
>Then I used lukeall-0.5 to view the index directory.
>It's ok---Not single Chinese characters but Chinese words have been 
>organized as terms.
>But when I deploy Nutch in Tomcat5.5 and do the searching test,
>it cann't find anything. What's wrong?
>I need your hints or you may recommend me some articles about this.
>Best regards.
>Cao Yuzhong

View raw message