nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Tang <him...@gmail.com>
Subject NutchAnalysis and CJK
Date Fri, 15 Jul 2005 02:49:01 GMT
Hi All

It takes long time for me to think about embedding improved
CJKAnalysis into NutchAnalysis. I got nothing but some failure
experiences, and share with you, maybe you can hack it( well, I am not
going to give up).

I have written several Chinese words segmentation, some are dictionary
based, such as Forward Maximum Matching(FMM) and Backward Maximum
Matching(BMM), and some auto-segmentation, say bi-gram. And they work
fine in pure Chinese words env.(not the mixture of Chinese and other
languages).

Why I only aim at pure Chinese words env.? In NutchAnalysis.jj 

<orig>

  // chinese, japanese and korean characters
| <SIGRAM: <CJK> >

</orig>

<modified>

  // chinese, japanese and korean characters
| <SIGRAM: (<CJK>)+ >

</modified>

SIGRAM only contains CJK words.

Well, I am not much familiar with JavaCC, so the big puzzle pauses me.
As you know:

  // basic word -- lowercase it
<WORD: ((<LETTER>|<DIGIT>|<WORD_PUNCT>)+ | <IRREGULAR_WORD>)>
  { matchedToken.image = matchedToken.image.toLowerCase(); }
  
this statement means if the sentence matches "WORD" rule, then the
wrapped object matchedToken will extract
target word. *ONE* word is extracted in one matching.

so, in term() function, it is simple.

/** Parse a single term. */
String term() :
{
  Token token;
}
{
  ( token=<WORD> | token=<ACRONYM>) // I don't think it is reasonable
put "token=<SIGRAM>" here.

  { return token.image; }
}

For CJK it is quite different. We have to extract *MANY* words in one matching.

  // chinese, japanese and korean characters
| <SIGRAM: (<CJK>)+ >
{
// parse <CJK>+ will generate many words(tokens) here!
}

And my approach is constructing one TokenList to hold these tokens.
The pseudocode looks like

  // chinese, japanese and korean characters
| <SIGRAM: (<CJK>)+ >
{
for (int i = 0; i < image.length();...) {
Token token = extract in bi-gram.
tokenList.add(token);
}
}

accordingly, the term() function should return ArrayList.

/** .... **/
ArrayList term():
{
Token token;
}
{
(token=<WORD> | token=<ACRONYM> | token=<SIGRAM>)
  {
    return tokenList;
  }

}

After these modification, running NutchAnalysis.class, you will get odd result.
Say, I input some Chinese characters:C1C2C3
the result will be: "C1C2 C2C3" (NOTICE the quotation mark).

I am in the wrong direction? Or will someone share any thoughts on
NutchAnalysis.jj? Thanks



Regards
/Jack

-- 
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Mime
View raw message