nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Tang" <jason.t...@commcentral.com>
Subject Re: Re: [Nutch-dev] RE: A problem about Chinese word segment
Date Thu, 17 Mar 2005 07:13:37 GMT
weird! Nutch supports Chinese characters searching.

Can you print your query string in search.jsp? 
NOTE: the page should be encoded in UTF-8.


/Jack
  
======= At 2005-03-17, 13:49:00 you wrote: =======

>I have added Chinese stopwords in String[] STOP_WORDS in NutchAnalysis.jj.
>My problem is Nutch returns nothing when I using any Chinese keywords.
>Even though I can find these Chinese keywords in the index files(using 
>luke).
>
>
>>From: "Jason Tang" <jason.tang@commcentral.com>
>>Reply-To: nutch-dev@incubator.apache.org
>>To: "dev@nutch.org" <dev@nutch.org>
>>Subject: Re: [Nutch-dev] RE: A problem about Chinese word segment
>>Date: Thu, 17 Mar 2005 11:08:15 +0800
>>
>>Hi cao
>>
>>I think character "的" is stopword in Chinese characters.
>>I think NutchAnalysis.jj should load different stopwords file when the 
>language is different.
>>
>>/Jack
>>
>>
>>
>>======= At 2005-03-17, 10:27:40 you wrote: =======
>>
>> >No anwser for this?
>> >Any tips are appreciated.
>> >
>> >>From: "cao yuzhong" <caoyuzhong@hotmail.com>
>> >>Reply-To: nutch-dev@incubator.apache.org
>> >>To: nutch-dev@incubator.apache.org
>> >>CC: caoyuzhong@hotmail.com
>> >>Subject: A problem about Chinese word segment
>> >>Date: Tue, 15 Mar 2005 05:16:30 +0000
>> >>
>> >>hi,all
>> >>
>> >>Now,Nutch-0.6 simply treats a Chinese character as a single token.
>> >>I have attempted to make it treating some relative Chinese
>> >>characters(called Chinese word) as a token.
>> >>So I need to modified the Analyzer.
>> >>
>> >>First,I modified the file NutchAnalysis.jj in
>> >>src/java/net/nutch/analysis.
>> >>I changed " <SIGRAM: <CJK> > " to " <SIGRAM: (<CJK>)+
> " so that
>> >>Nutch can
>> >>treat one or more Chinese characters as a token. Then I used JavaCC
>> >>to generate the code.
>> >>
>> >>Second,I have to segment Chinese texts into Chinese words(insert
>> >>space between two Chinese words) before indexing so that Nutch can
>> >>recognize them.I have written a class
>> >>to do this and I have modified the function refill() in
>> >>FastCharStream.java:
>> >>
>> >>below the line :
>> >>int charsRead =input.read(buffer, newPosition,
>> >>buffer.length-newPosition);
>> >>
>> >>I added:
>> >>//----
>> >>if(charsRead!=-1){
>> >>
>> >>String str=new String(buffer,newPostion,charsRead);
>> >>
>> >>//do Chinese word segment,fox example
>> >>//if str1="中文搜索引擎的分词问题"
>> >>//then str2 will be "中文 搜索引擎 的 分词 问题"
>> >>String str2 = Spliter.segSentence(str1);
>> >>
>> >>while(str2.length()>buffer.length-newPosition){  //expand the buffer
>> >>          char[] newBuffer = new char[buffer.length*2];
>> >>          System.arraycopy(buffer, 0, newBuffer, 0, buffer.length);
>> >>          buffer = newBuffer;
>> >>}
>> >>
>> >>for(int i=0;i<str2.length();i++){
>> >>            buffer[newPosition+i]=str2.charAt(i);
>> >>}
>> >>charsRead=str2.length();
>> >>  }
>> >>//----
>> >>
>> >>Third, compiling... ,running CrawlTool....
>> >>Then I used lukeall-0.5 to view the index directory.
>> >>It's ok---Not single Chinese characters but Chinese words have been
>> >>organized as terms.
>> >>
>> >>But when I deploy Nutch in Tomcat5.5 and do the searching test,
>> >>it cann't find anything. What's wrong?
>> >>
>> >>I need your hints or you may recommend me some articles about this.
>> >>
>> >>Best regards.
>> >>
>> >>Cao Yuzhong
>> >>2005-03-15
>> >>
>> >>
>> >
>> >
>> >
>> >
>> >-------------------------------------------------------
>> >SF email is sponsored by - The IT Product Guide
>> >Read honest & candid reviews on hundreds of IT Products from real users.
>> >Discover which products truly live up to the hype. Start reading now.
>> >http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
>> >_______________________________________________
>> >Nutch-developers mailing list
>> >Nutch-developers@lists.sourceforge.net
>> >https://lists.sourceforge.net/lists/listinfo/nutch-developers
>>
>>= = = = = = = = = = = = = = = = = = = =
>>
>
>
>
>
>-------------------------------------------------------
>SF email is sponsored by - The IT Product Guide
>Read honest & candid reviews on hundreds of IT Products from real users.
>Discover which products truly live up to the hype. Start reading now.
>http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
>_______________________________________________
>Nutch-developers mailing list
>Nutch-developers@lists.sourceforge.net
>https://lists.sourceforge.net/lists/listinfo/nutch-developers

= = = = = = = = = = = = = = = = = = = =


Mime
View raw message