nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Tang <>
Subject NutchAnalysis and CJK
Date Fri, 15 Jul 2005 02:49:01 GMT
Hi All

It takes long time for me to think about embedding improved
CJKAnalysis into NutchAnalysis. I got nothing but some failure
experiences, and share with you, maybe you can hack it( well, I am not
going to give up).

I have written several Chinese words segmentation, some are dictionary
based, such as Forward Maximum Matching(FMM) and Backward Maximum
Matching(BMM), and some auto-segmentation, say bi-gram. And they work
fine in pure Chinese words env.(not the mixture of Chinese and other

Why I only aim at pure Chinese words env.? In NutchAnalysis.jj 


  // chinese, japanese and korean characters
| <SIGRAM: <CJK> >



  // chinese, japanese and korean characters
| <SIGRAM: (<CJK>)+ >


SIGRAM only contains CJK words.

Well, I am not much familiar with JavaCC, so the big puzzle pauses me.
As you know:

  // basic word -- lowercase it
  { matchedToken.image = matchedToken.image.toLowerCase(); }
this statement means if the sentence matches "WORD" rule, then the
wrapped object matchedToken will extract
target word. *ONE* word is extracted in one matching.

so, in term() function, it is simple.

/** Parse a single term. */
String term() :
  Token token;
  ( token=<WORD> | token=<ACRONYM>) // I don't think it is reasonable
put "token=<SIGRAM>" here.

  { return token.image; }

For CJK it is quite different. We have to extract *MANY* words in one matching.

  // chinese, japanese and korean characters
| <SIGRAM: (<CJK>)+ >
// parse <CJK>+ will generate many words(tokens) here!

And my approach is constructing one TokenList to hold these tokens.
The pseudocode looks like

  // chinese, japanese and korean characters
| <SIGRAM: (<CJK>)+ >
for (int i = 0; i < image.length();...) {
Token token = extract in bi-gram.

accordingly, the term() function should return ArrayList.

/** .... **/
ArrayList term():
Token token;
(token=<WORD> | token=<ACRONYM> | token=<SIGRAM>)
    return tokenList;


After these modification, running NutchAnalysis.class, you will get odd result.
Say, I input some Chinese characters:C1C2C3
the result will be: "C1C2 C2C3" (NOTICE the quotation mark).

I am in the wrong direction? Or will someone share any thoughts on
NutchAnalysis.jj? Thanks


Keep Discovering ... ...

View raw message