lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom Zimmermann <>
Subject Re: Preferred Scema/Config for Chinese Language Cores?
Date Mon, 08 Dec 2014 16:29:35 GMT
I tracked down an example from a sample solr config of a CJK setup with
bigrams and no CJK tokenizer:

fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="100">


<tokenizer class="solr.StandardTokenizerFactory"/>


normalize width before bigram, as e.g. half-width dakuten combine


<filter class="solr.CJKWidthFilterFactory"/>

<!--  for any non-CJK  -->

<filter class="solr.LowerCaseFilterFactory"/>

<filter class="solr.CJKBigramFilterFactory"/>



Seems like it could be a good approach, but I also saw mention of an ICU
Tokenizer that might be well suited to Chinese text, but may be intended
for a multilingual field? (
Anyone have an familiarity with ICU vs Standard for a field that will store
only Chinese text.


On Fri, Dec 5, 2014 at 5:41 PM, Tom Zimmermann <> wrote:

> Thanks for the links. The dzone lnk was nice and concise, but
> unfortunately makes use of the now deprecated CJK tokenizer. Does anyone
> out there have some examples or experience working with the recommended
> replacement for CJK?
> Thanks,
> TZ

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message