nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "cao yuzhong" <caoyuzh...@hotmail.com>
Subject Re: [Nutch-dev] RE: A problem about Chinese word segment
Date Thu, 17 Mar 2005 05:49:00 GMT
I have added Chinese stopwords in String[] STOP_WORDS in NutchAnalysis.jj.
My problem is Nutch returns nothing when I using any Chinese keywords.
Even though I can find these Chinese keywords in the index files(using 
luke).


>From: "Jason Tang" <jason.tang@commcentral.com>
>Reply-To: nutch-dev@incubator.apache.org
>To: "dev@nutch.org" <dev@nutch.org>
>Subject: Re: [Nutch-dev] RE: A problem about Chinese word segment
>Date: Thu, 17 Mar 2005 11:08:15 +0800
>
>Hi cao
>
>I think character "的" is stopword in Chinese characters.
>I think NutchAnalysis.jj should load different stopwords file when the 
language is different.
>
>/Jack
>
>
>
>======= At 2005-03-17, 10:27:40 you wrote: =======
>
> >No anwser for this?
> >Any tips are appreciated.
> >
> >>From: "cao yuzhong" <caoyuzhong@hotmail.com>
> >>Reply-To: nutch-dev@incubator.apache.org
> >>To: nutch-dev@incubator.apache.org
> >>CC: caoyuzhong@hotmail.com
> >>Subject: A problem about Chinese word segment
> >>Date: Tue, 15 Mar 2005 05:16:30 +0000
> >>
> >>hi,all
> >>
> >>Now,Nutch-0.6 simply treats a Chinese character as a single token.
> >>I have attempted to make it treating some relative Chinese
> >>characters(called Chinese word) as a token.
> >>So I need to modified the Analyzer.
> >>
> >>First,I modified the file NutchAnalysis.jj in
> >>src/java/net/nutch/analysis.
> >>I changed " <SIGRAM: <CJK> > " to " <SIGRAM: (<CJK>)+ >
" so that
> >>Nutch can
> >>treat one or more Chinese characters as a token. Then I used JavaCC
> >>to generate the code.
> >>
> >>Second,I have to segment Chinese texts into Chinese words(insert
> >>space between two Chinese words) before indexing so that Nutch can
> >>recognize them.I have written a class
> >>to do this and I have modified the function refill() in
> >>FastCharStream.java:
> >>
> >>below the line :
> >>int charsRead =input.read(buffer, newPosition,
> >>buffer.length-newPosition);
> >>
> >>I added:
> >>//----
> >>if(charsRead!=-1){
> >>
> >>String str=new String(buffer,newPostion,charsRead);
> >>
> >>//do Chinese word segment,fox example
> >>//if str1="中文搜索引擎的分词问题"
> >>//then str2 will be "中文 搜索引擎 的 分词 问题"
> >>String str2 = Spliter.segSentence(str1);
> >>
> >>while(str2.length()>buffer.length-newPosition){  //expand the buffer
> >>          char[] newBuffer = new char[buffer.length*2];
> >>          System.arraycopy(buffer, 0, newBuffer, 0, buffer.length);
> >>          buffer = newBuffer;
> >>}
> >>
> >>for(int i=0;i<str2.length();i++){
> >>            buffer[newPosition+i]=str2.charAt(i);
> >>}
> >>charsRead=str2.length();
> >>  }
> >>//----
> >>
> >>Third, compiling... ,running CrawlTool....
> >>Then I used lukeall-0.5 to view the index directory.
> >>It's ok---Not single Chinese characters but Chinese words have been
> >>organized as terms.
> >>
> >>But when I deploy Nutch in Tomcat5.5 and do the searching test,
> >>it cann't find anything. What's wrong?
> >>
> >>I need your hints or you may recommend me some articles about this.
> >>
> >>Best regards.
> >>
> >>Cao Yuzhong
> >>2005-03-15
> >>
> >>
> >
> >
> >
> >
> >-------------------------------------------------------
> >SF email is sponsored by - The IT Product Guide
> >Read honest & candid reviews on hundreds of IT Products from real users.
> >Discover which products truly live up to the hype. Start reading now.
> >http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> >_______________________________________________
> >Nutch-developers mailing list
> >Nutch-developers@lists.sourceforge.net
> >https://lists.sourceforge.net/lists/listinfo/nutch-developers
>
>= = = = = = = = = = = = = = = = = = = =
>



Mime
View raw message