lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Walter Underwood <wun...@wunderwood.org>
Subject Re: Best practices for Solr highlighter for CJK
Date Wed, 02 Jan 2013 19:00:34 GMT
Speaking from experience: if you are using bigrams for CJK, do not highlight. The results will
look very wrong to someone who knows the language.

Even with a dictionary-based tokenizer, you'll need a client dictionary for local terms.

wunder

On Jan 2, 2013, at 10:51 AM, Tom Burton-West wrote:

> Hello all,
> 
> What are the best practices for setting up the highlighter to work with CJK?
> We are using the ICUTokenizer with the CJKBigramFilter, so overlapping
> bigrams are what are actually being searched. However the highlighter seems
> to only highlight the first of any two overlapping bigrams.   i.e.  ABC =>
> searched as AB BC  only AB gets highlighted even if the matching string is
> ABC. (Where ABC are chinese characters such as 大亚湾  => searched as 大亚 亚湾,
> but only   大亚 is highlighted rather than 大亚湾)
> 
> Is there some highlighting parameter that might fix this?
> 
> Tom Burton-West





Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message