nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Tang <him...@gmail.com>
Subject Re: [jira] Commented: (NUTCH-36) Chinese in Nutch
Date Tue, 27 Sep 2005 15:10:01 GMT
Hi Kerang

I think it is good like we can write our own CJK bi-gram segmentation.
The 3rd-part CJKTokenizer do a lot of duplicate work which
NutchAnalysis does.
If "+| <SIGRAM: (<CJK>)+ >", then the new CJKTokenizer  only focus on CJK words.

My another idea of CJK segmentation is making CJKTokenizer  as an
interface and it can be configured in
nutch-default.xml/nutch-site.xml. I think the design will improved CJK
segmentation in future.

Comments?

Regards
/Jack

On 9/27/05, Kerang Lv (JIRA) <jira@apache.org> wrote:
>     [ http://issues.apache.org/jira/browse/NUTCH-36?page=comments#action_12330588 ]
>
> Kerang Lv commented on NUTCH-36:
> --------------------------------
>
> Code of a kind can be used to perform third-part CJK word
> segmentation in NutchAnalysis.jj. CJKTokenizer, a kind of bi-gram segmentation , was
used in the following example.
> ================================================================================
> @@ -33,6 +33,7 @@
>  import org.apache.nutch.searcher.Query.Clause;
>
>  import org.apache.lucene.analysis.StopFilter;
> +import org.apache.lucene.analysis.cjk.CJKTokenizer;
>
>  import java.io.*;
>  import java.util.*;
> @@ -81,6 +82,14 @@
>  PARSER_END(NutchAnalysis)
>
>  TOKEN_MGR_DECLS : {
> +  /** use CJKTokenizer to process cjk character */
> +  private CJKTokenizer cjkTokenizer = null;
> +
> +  /** a global cjk token */
> +  private org.apache.lucene.analysis.Token cjkToken = null;
> +
> +  /** start offset of cjk sequence */
> +  private int cjkStartOffset = 0;
>
>    /** Constructs a token manager for the provided Reader. */
>    public NutchAnalysisTokenManager(Reader reader) {
> @@ -106,7 +115,46 @@
>      }
>
>    // chinese, japanese and korean characters
> -| <SIGRAM: <CJK> >
> +| <SIGRAM: (<CJK>)+ >
> +  {
> +    /**
> +     * use an instance of CJKTokenizer, cjkTokenizer, hold the maximum
> +     * matched cjk chars, and cjkToken for the current token;
> +     * reset matchedToken.image use cjkToken.termText();
> +     * reset matchedToken.beginColumn use cjkToken.startOffset();
> +     * reset matchedToken.endColumn use cjkToken.endOffset();
> +     * backup the last char when the next cjkToken is valid.
> +     */
> +    if(cjkTokenizer == null) {
> +      cjkTokenizer = new CJKTokenizer(new StringReader(image.toString()));
> +      cjkStartOffset = matchedToken.beginColumn;
> +      try {
> +        cjkToken = cjkTokenizer.next();
> +      } catch(IOException ioe) {
> +        cjkToken = null;
> +      }
> +    }
> +
> +    if(cjkToken != null && !cjkToken.termText().equals("")) {
> +      //sometime the cjkTokenizer returns an empty string, is it a bug?
> +      matchedToken.image = cjkToken.termText();
> +      matchedToken.beginColumn = cjkStartOffset + cjkToken.startOffset();
> +      matchedToken.endColumn = cjkStartOffset + cjkToken.endOffset();
> +      try {
> +        cjkToken = cjkTokenizer.next();
> +      } catch(IOException ioe) {
> +        cjkToken = null;
> +      }
> +      if(cjkToken != null && !cjkToken.termText().equals("")) {
> +        input_stream.backup(1);
> +      }
> +    }
> +
> +    if(cjkToken == null || cjkToken.termText().equals("")) {
> +      cjkTokenizer = null;
> +      cjkStartOffset = 0;
> +    }
> +  }
>
>
> > Chinese in Nutch
> > ----------------
> >
> >          Key: NUTCH-36
> >          URL: http://issues.apache.org/jira/browse/NUTCH-36
> >      Project: Nutch
> >         Type: Improvement
> >   Components: indexer, searcher
> >  Environment: all
> >     Reporter: Jack Tang
> >     Priority: Minor
> >  Attachments: &#26700
> >
> > Nutch now support Chinese in very simple way: NutchAnalysis segments CJK term word-by-word.
> > So, if I search Chinese term 'FooBar'(two Chinese words: 'Foo' and 'Bar'), the result
in web gui will highlight 'FooBar' and 'Foo', 'Bar'. While we expect Nutch only highlights
'FooBar'.
>
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the administrators:
>    http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:
>    http://www.atlassian.com/software/jira
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Mime
View raw message