lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christian Moen (JIRA)" <>
Subject [jira] [Commented] (SOLR-3524) Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory
Date Fri, 08 Jun 2012 08:52:23 GMT


Christian Moen commented on SOLR-3524:

Hiraga-san, there are different views on how punctuation characters best are handled by tokenizers.
 Punctuation characters generally don't convey much meaning useful for text search, so they
are generally removed in Lucene. (A different point of view is that tokenizers shouldn't remove
punctuations and that filters should do this.)

The ability to keep punctuation was left as an expert-feature in JapanseTokenizer and I think
we can expose this as an expert feature in Solr as well.  Could you share some details on
your use-case just so that I get a better idea of the background and importance of this?


> Make discard-punctuation feature in Kuromoji configurable from JapaneseTokenizerFactory
> ---------------------------------------------------------------------------------------
>                 Key: SOLR-3524
>                 URL:
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.6
>            Reporter: Kazuaki Hiraga
>            Priority: Minor
>         Attachments: kuromoji_discard_punctuation.patch.txt
> JapaneseTokenizer, Kuromoji doesn't provide configuration option to preserve punctuation
in Japanese text, although It has a parameter to change this behavior.  JapaneseTokenizerFactory
always set third parameter, which controls this behavior, to true to remove punctuation.
> I would like to have an option I can configure this behavior by fieldtype definition
in schema.xml.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message