nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "cao yuzhong" <caoyuzh...@hotmail.com>
Subject RE: A problem about Chinese word segment
Date Thu, 17 Mar 2005 02:27:40 GMT
No anwser for this?
Any tips are appreciated.

>From: "cao yuzhong" <caoyuzhong@hotmail.com>
>Reply-To: nutch-dev@incubator.apache.org
>To: nutch-dev@incubator.apache.org
>CC: caoyuzhong@hotmail.com
>Subject: A problem about Chinese word segment
>Date: Tue, 15 Mar 2005 05:16:30 +0000
>
>hi,all
>
>Now,Nutch-0.6 simply treats a Chinese character as a single token.
>I have attempted to make it treating some relative Chinese 
>characters(called Chinese word) as a token.
>So I need to modified the Analyzer.
>
>First,I modified the file NutchAnalysis.jj in 
>src/java/net/nutch/analysis.
>I changed " <SIGRAM: <CJK> > " to " <SIGRAM: (<CJK>)+ > " so that

>Nutch can
>treat one or more Chinese characters as a token. Then I used JavaCC 
>to generate the code.
>
>Second,I have to segment Chinese texts into Chinese words(insert 
>space between two Chinese words) before indexing so that Nutch can 
>recognize them.I have written a class
>to do this and I have modified the function refill() in 
>FastCharStream.java:
>
>below the line :
>int charsRead =input.read(buffer, newPosition, 
>buffer.length-newPosition);
>
>I added:
>//----
>if(charsRead!=-1){
>
>String str=new String(buffer,newPostion,charsRead);
>
>//do Chinese word segment,fox example
>//if str1="中文搜索引擎的分词问题"
>//then str2 will be "中文 搜索引擎 的 分词 问题"
>String str2 = Spliter.segSentence(str1);
>
>while(str2.length()>buffer.length-newPosition){  //expand the buffer
>          char[] newBuffer = new char[buffer.length*2];
>          System.arraycopy(buffer, 0, newBuffer, 0, buffer.length);
>          buffer = newBuffer;
>}
>
>for(int i=0;i<str2.length();i++){
>            buffer[newPosition+i]=str2.charAt(i);
>}
>charsRead=str2.length();
>  }
>//----
>
>Third, compiling... ,running CrawlTool....
>Then I used lukeall-0.5 to view the index directory.
>It's ok---Not single Chinese characters but Chinese words have been 
>organized as terms.
>
>But when I deploy Nutch in Tomcat5.5 and do the searching test,
>it cann't find anything. What's wrong?
>
>I need your hints or you may recommend me some articles about this.
>
>Best regards.
>
>Cao Yuzhong
>2005-03-15
>
>



Mime
View raw message