lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dawid Weiss <dawid.we...@gmail.com>
Subject Re: Japanese analyzer
Date Fri, 18 Jan 2013 13:51:41 GMT
Jerome,

Some of the tokens are removed because their part of speech tags are
in the stoptags file? That's my guess at least -- you can always try
to copy/paste Japanese analyzer and change the token stream
components:

  protected TokenStreamComponents createComponents(String fieldName,
Reader reader) {
    Tokenizer tokenizer = new JapaneseTokenizer(reader, userDict, true, mode);
    TokenStream stream = new JapaneseBaseFormFilter(tokenizer);
    stream = new JapanesePartOfSpeechStopFilter(true, stream,
stoptags);    << this is the thing I was talking about.
    stream = new CJKWidthFilter(stream);
    stream = new StopFilter(matchVersion, stream, stopwords);
    stream = new JapaneseKatakanaStemFilter(stream);
    stream = new LowerCaseFilter(matchVersion, stream);
    return new TokenStreamComponents(tokenizer, stream);
  }

Dawid

On Fri, Jan 18, 2013 at 2:46 PM, Jerome Lanneluc
<jerome_lanneluc@fr.ibm.com> wrote:
> Thanks for your answer.
>
> No those words are not part of the stop word file (I'm using the one that
> comes with the Japanese analyzer in lucene-kuromoji-3.6.1.jar.
>
> My Japanese contact told me that the first sentence means "I am Japanese"
> and the second one is a unit of length.
>
> Jerome
>
>
>
> From:   Swapnil Patil <ping.swapnil@gmail.com>
> To:     java-user@lucene.apache.org,
> Date:   01/18/2013 02:33 PM
> Subject:        Re: Japanese analyzer
>
>
>
> Hi,
>
> I just translated these words, using google translate look like   Japanese
> I [
> Can you check if these words are  in your stopword file.
> if these words exits in your stop word file than you will not get them in
> token stream.
>
> -Swapnil
>
> On Fri, Jan 18, 2013 at 6:58 PM, Jerome Lanneluc
> <jerome_lanneluc@fr.ibm.com
>> wrote:
>
>> [私 日本人
>
>
>
> Sauf indication contraire ci-dessus:/ Unless stated otherwise above:
> Compagnie IBM France
> Siège Social : 17 avenue de l'Europe, 92275 Bois-Colombes Cedex
> RCS Nanterre 552 118 465
> Forme Sociale : S.A.S.
> Capital Social : 653.242.306,20 €
> SIREN/SIRET : 552 118 465 03644 - Code NAF 6202A

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message