nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Transbuerg Tian <accesine.j...@gmail.com>
Subject Re: NutchAnalysis and CJK
Date Fri, 15 Jul 2005 04:34:52 GMT
hi,
Jack Tang

I have the same condition with u , could you share your total 
NutchAnalysis.jj file at here, I am not use nutch but lucene .

good luck.

http://blog.csdn.net/accesine960/archive/2005/07/13/424306.aspx


2005/7/15, Jack Tang <himars@gmail.com>:
> 
> Hi All
> 
> It takes long time for me to think about embedding improved
> CJKAnalysis into NutchAnalysis. I got nothing but some failure
> experiences, and share with you, maybe you can hack it( well, I am not
> going to give up).
> 
> I have written several Chinese words segmentation, some are dictionary
> based, such as Forward Maximum Matching(FMM) and Backward Maximum
> Matching(BMM), and some auto-segmentation, say bi-gram. And they work
> fine in pure Chinese words env.(not the mixture of Chinese and other
> languages).
> 
> Why I only aim at pure Chinese words env.? In NutchAnalysis.jj
> 
> <orig>
> 
> // chinese, japanese and korean characters
> | <SIGRAM: <CJK> >
> 
> </orig>
> 
> <modified>
> 
> // chinese, japanese and korean characters
> | <SIGRAM: (<CJK>)+ >
> 
> </modified>
> 
> SIGRAM only contains CJK words.
> 
> Well, I am not much familiar with JavaCC, so the big puzzle pauses me.
> As you know:
> 
> // basic word -- lowercase it
> <WORD: ((<LETTER>|<DIGIT>|<WORD_PUNCT>)+ | <IRREGULAR_WORD>)>
> { matchedToken.image = matchedToken.image.toLowerCase(); }
> 
> this statement means if the sentence matches "WORD" rule, then the
> wrapped object matchedToken will extract
> target word. *ONE* word is extracted in one matching.
> 
> so, in term() function, it is simple.
> 
> /** Parse a single term. */
> String term() :
> {
> Token token;
> }
> {
> ( token=<WORD> | token=<ACRONYM>) // I don't think it is reasonable
> put "token=<SIGRAM>" here.
> 
> { return token.image; }
> }
> 
> For CJK it is quite different. We have to extract *MANY* words in one 
> matching.
> 
> // chinese, japanese and korean characters
> | <SIGRAM: (<CJK>)+ >
> {
> // parse <CJK>+ will generate many words(tokens) here!
> }
> 
> And my approach is constructing one TokenList to hold these tokens.
> The pseudocode looks like
> 
> // chinese, japanese and korean characters
> | <SIGRAM: (<CJK>)+ >
> {
> for (int i = 0; i < image.length();...) {
> Token token = extract in bi-gram.
> tokenList.add(token);
> }
> }
> 
> accordingly, the term() function should return ArrayList.
> 
> /** .... **/
> ArrayList term():
> {
> Token token;
> }
> {
> (token=<WORD> | token=<ACRONYM> | token=<SIGRAM>)
> {
> return tokenList;
> }
> 
> }
> 
> After these modification, running NutchAnalysis.class, you will get odd 
> result.
> Say, I input some Chinese characters:C1C2C3
> the result will be: "C1C2 C2C3" (NOTICE the quotation mark).
> 
> I am in the wrong direction? Or will someone share any thoughts on
> NutchAnalysis.jj? Thanks
> 
> 
> 
> Regards
> /Jack
> 
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message