nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "cao yuzhong" <>
Subject A problem about Chinese word segment
Date Tue, 15 Mar 2005 05:16:30 GMT

 Now,Nutch-0.6 simply treats a Chinese character as a single token.
 I have attempted to make it treating some relative Chinese 
 characters(called Chinese word) as a token.
 So I need to modified the Analyzer.
 First,I modified the file NutchAnalysis.jj in src/java/net/nutch/analysis.
 I changed " <SIGRAM: <CJK> > " to " <SIGRAM: (<CJK>)+ > " so that
 treat one or more Chinese characters as a token. 
 Then I used JavaCC to generate the code.
 Second,I have to segment Chinese texts into Chinese words(insert space 
between two Chinese 
 words) before indexing so that Nutch can recognize them.I have written a 
 to do this and I have modified the function refill() in
 below the line :
 int charsRead, newPosition, buffer.length-newPosition);
 I added:
 String str=new String(buffer,newPostion,charsRead);
 //do Chinese word segment,fox example
 //if str1="中文搜索引擎的分词问题"
 //then str2 will be "中文 搜索引擎 的 分词 问题"
 String str2 = Spliter.segSentence(str1);
 while(str2.length()>buffer.length-newPosition){  //expand the buffer
          char[] newBuffer = new char[buffer.length*2];
          System.arraycopy(buffer, 0, newBuffer, 0, buffer.length);
          buffer = newBuffer;
 for(int i=0;i<str2.length();i++){
 Third, compiling... ,running CrawlTool....
 Then I used lukeall-0.5 to view the index directory.
 It's ok---Not single Chinese characters but Chinese words have been 
organized as terms.
 But when I deploy Nutch in Tomcat5.5 and do the searching test,
 it cann't find anything. 
 What's wrong?
 I need your hints or you may recommend me some articles about this.
 Best regards.
 Cao Yuzhong

View raw message