nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "cao yuzhong" <caoyuzh...@hotmail.com>
Subject A problem about Chinese word segment
Date Tue, 15 Mar 2005 05:16:30 GMT
hi,all

 Now,Nutch-0.6 simply treats a Chinese character as a single token.
 I have attempted to make it treating some relative Chinese 
 characters(called Chinese word) as a token.
 So I need to modified the Analyzer.
 
 First,I modified the file NutchAnalysis.jj in src/java/net/nutch/analysis.
 I changed " <SIGRAM: <CJK> > " to " <SIGRAM: (<CJK>)+ > " so that
Nutch 
can
 treat one or more Chinese characters as a token. 
 Then I used JavaCC to generate the code.
 
 Second,I have to segment Chinese texts into Chinese words(insert space 
between two Chinese 
 words) before indexing so that Nutch can recognize them.I have written a 
class
 to do this and I have modified the function refill() in 
FastCharStream.java:
 
 below the line :
 int charsRead =input.read(buffer, newPosition, buffer.length-newPosition);
 
 I added:
 //----
 if(charsRead!=-1){
 
 String str=new String(buffer,newPostion,charsRead);
 
 //do Chinese word segment,fox example
 //if str1="中文搜索引擎的分词问题"
 //then str2 will be "中文 搜索引擎 的 分词 问题"
 String str2 = Spliter.segSentence(str1);
 
 while(str2.length()>buffer.length-newPosition){  //expand the buffer
          char[] newBuffer = new char[buffer.length*2];
          System.arraycopy(buffer, 0, newBuffer, 0, buffer.length);
          buffer = newBuffer;
 }
 
 for(int i=0;i<str2.length();i++){
            buffer[newPosition+i]=str2.charAt(i);
 }
 charsRead=str2.length();
  
 }
 //----
 
 Third, compiling... ,running CrawlTool....
 Then I used lukeall-0.5 to view the index directory.
 It's ok---Not single Chinese characters but Chinese words have been 
organized as terms.
 
 But when I deploy Nutch in Tomcat5.5 and do the searching test,
 it cann't find anything. 
 What's wrong?
 
 I need your hints or you may recommend me some articles about this.
 
 Best regards.
 
 Cao Yuzhong
 2005-03-15



Mime
View raw message