nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "cao yuzhong" <caoyuzh...@hotmail.com>
Subject Re: Re: [Nutch-dev] RE: A problem about Chinese word segment
Date Thu, 17 Mar 2005 08:44:16 GMT
When I use "北京" as keywords,the query string is:
http://127.0.0.1:8080/search.jsp?query=%E5%8C%97%E4%BA%AC

It returns 0 results.

But when I use class NutchBean to search "北京",it returns 23 hits.
But there maybe something wrong for there are many blank lines within the 
output.
The output is like this:

Total hits: 23

050317 163326 10 found resource common-terms.utf8 at 
file:/D:/nutch-0.6/conf/common-terms.utf8

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 0 20050317162844/6

 1 20050317162844/21

 2 20050317162844/3

 3 20050317162844/22

 4 20050317162844/c

 5 20050317162844/10

 6 20050317162844/11

 7 20050317162844/19

 8 20050317162844/25

 9 20050317162844/d


>From: "Jason Tang" <jason.tang@commcentral.com>
>Reply-To: nutch-dev@incubator.apache.org
>To: "dev@nutch.org" <dev@nutch.org>
>Subject: Re: Re: [Nutch-dev] RE: A problem about Chinese word segment
>Date: Thu, 17 Mar 2005 15:13:37 +0800
>
>weird! Nutch supports Chinese characters searching.
>
>Can you print your query string in search.jsp?
>NOTE: the page should be encoded in UTF-8.
>
>
>/Jack
>
>======= At 2005-03-17, 13:49:00 you wrote: =======
>
> >I have added Chinese stopwords in String[] STOP_WORDS in 
NutchAnalysis.jj.
> >My problem is Nutch returns nothing when I using any Chinese keywords.
> >Even though I can find these Chinese keywords in the index files(using
> >luke).
> >
> >
> >>From: "Jason Tang" <jason.tang@commcentral.com>
> >>Reply-To: nutch-dev@incubator.apache.org
> >>To: "dev@nutch.org" <dev@nutch.org>
> >>Subject: Re: [Nutch-dev] RE: A problem about Chinese word segment
> >>Date: Thu, 17 Mar 2005 11:08:15 +0800
> >>
> >>Hi cao
> >>
> >>I think character "的" is stopword in Chinese characters.
> >>I think NutchAnalysis.jj should load different stopwords file when the
> >language is different.
> >>
> >>/Jack
> >>
> >>
> >>
> >>======= At 2005-03-17, 10:27:40 you wrote: =======
> >>
> >> >No anwser for this?
> >> >Any tips are appreciated.
> >> >
> >> >>From: "cao yuzhong" <caoyuzhong@hotmail.com>
> >> >>Reply-To: nutch-dev@incubator.apache.org
> >> >>To: nutch-dev@incubator.apache.org
> >> >>CC: caoyuzhong@hotmail.com
> >> >>Subject: A problem about Chinese word segment
> >> >>Date: Tue, 15 Mar 2005 05:16:30 +0000
> >> >>
> >> >>hi,all
> >> >>
> >> >>Now,Nutch-0.6 simply treats a Chinese character as a single token.
> >> >>I have attempted to make it treating some relative Chinese
> >> >>characters(called Chinese word) as a token.
> >> >>So I need to modified the Analyzer.
> >> >>
> >> >>First,I modified the file NutchAnalysis.jj in
> >> >>src/java/net/nutch/analysis.
> >> >>I changed " <SIGRAM: <CJK> > " to " <SIGRAM: (<CJK>)+
> " so that
> >> >>Nutch can
> >> >>treat one or more Chinese characters as a token. Then I used JavaCC
> >> >>to generate the code.
> >> >>
> >> >>Second,I have to segment Chinese texts into Chinese words(insert
> >> >>space between two Chinese words) before indexing so that Nutch can
> >> >>recognize them.I have written a class
> >> >>to do this and I have modified the function refill() in
> >> >>FastCharStream.java:
> >> >>
> >> >>below the line :
> >> >>int charsRead =input.read(buffer, newPosition,
> >> >>buffer.length-newPosition);
> >> >>
> >> >>I added:
> >> >>//----
> >> >>if(charsRead!=-1){
> >> >>
> >> >>String str=new String(buffer,newPostion,charsRead);
> >> >>
> >> >>//do Chinese word segment,fox example
> >> >>//if str1="中文搜索引擎的分词问题"
> >> >>//then str2 will be "中文 搜索引擎 的 分词 问题"
> >> >>String str2 = Spliter.segSentence(str1);
> >> >>
> >> >>while(str2.length()>buffer.length-newPosition){  //expand the buffer
> >> >>          char[] newBuffer = new char[buffer.length*2];
> >> >>          System.arraycopy(buffer, 0, newBuffer, 0, buffer.length);
> >> >>          buffer = newBuffer;
> >> >>}
> >> >>
> >> >>for(int i=0;i<str2.length();i++){
> >> >>            buffer[newPosition+i]=str2.charAt(i);
> >> >>}
> >> >>charsRead=str2.length();
> >> >>  }
> >> >>//----
> >> >>
> >> >>Third, compiling... ,running CrawlTool....
> >> >>Then I used lukeall-0.5 to view the index directory.
> >> >>It's ok---Not single Chinese characters but Chinese words have been
> >> >>organized as terms.
> >> >>
> >> >>But when I deploy Nutch in Tomcat5.5 and do the searching test,
> >> >>it cann't find anything. What's wrong?
> >> >>
> >> >>I need your hints or you may recommend me some articles about this.
> >> >>
> >> >>Best regards.
> >> >>
> >> >>Cao Yuzhong
> >> >>2005-03-15
> >> >>
> >> >>
> >> >
> >> >
> >> >
> >> >
> >> >-------------------------------------------------------
> >> >SF email is sponsored by - The IT Product Guide
> >> >Read honest & candid reviews on hundreds of IT Products from real 
users.
> >> >Discover which products truly live up to the hype. Start reading now.
> >> >http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> >> >_______________________________________________
> >> >Nutch-developers mailing list
> >> >Nutch-developers@lists.sourceforge.net
> >> >https://lists.sourceforge.net/lists/listinfo/nutch-developers
> >>
> >>= = = = = = = = = = = = = = = = = = = =
> >>
> >
> >
> >
> >
> >-------------------------------------------------------
> >SF email is sponsored by - The IT Product Guide
> >Read honest & candid reviews on hundreds of IT Products from real users.
> >Discover which products truly live up to the hype. Start reading now.
> >http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> >_______________________________________________
> >Nutch-developers mailing list
> >Nutch-developers@lists.sourceforge.net
> >https://lists.sourceforge.net/lists/listinfo/nutch-developers
>
>= = = = = = = = = = = = = = = = = = = =
>
>



Mime
View raw message