lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom Zimmermann <zimm.to...@gmail.com>
Subject Re: Preferred Scema/Config for Chinese Language Cores?
Date Mon, 08 Dec 2014 16:29:35 GMT
I tracked down an example from a sample solr config of a CJK setup with
bigrams and no CJK tokenizer:

<
fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="100">

<analyzer>

<tokenizer class="solr.StandardTokenizerFactory"/>

<!--

normalize width before bigram, as e.g. half-width dakuten combine

-->

<filter class="solr.CJKWidthFilterFactory"/>

<!--  for any non-CJK  -->

<filter class="solr.LowerCaseFilterFactory"/>

<filter class="solr.CJKBigramFilterFactory"/>

</analyzer>

</fieldType>


Seems like it could be a good approach, but I also saw mention of an ICU
Tokenizer that might be well suited to Chinese text, but may be intended
for a multilingual field? (
https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-ICUTokenizer).
Anyone have an familiarity with ICU vs Standard for a field that will store
only Chinese text.


-Tom

On Fri, Dec 5, 2014 at 5:41 PM, Tom Zimmermann <zimm.tom.j@gmail.com> wrote:

> Thanks for the links. The dzone lnk was nice and concise, but
> unfortunately makes use of the now deprecated CJK tokenizer. Does anyone
> out there have some examples or experience working with the recommended
> replacement for CJK?
>
> Thanks,
> TZ
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message