lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: How are people using the ICUTokenizer?
Date Tue, 20 Jun 2017 17:54:52 GMT
> So, if you are trying to make sure your index breaks words properly on eastern languages,
just use ICU Tokenizer.   

I defer to the expertise on this list, but last I checked ICUTokenizer uses dictionary lookup
to tokenize CJK.  This may work well for some tasks, but I haven't evaluated whether it performs
better than smartcn or even just cjkbigramfilter on actual retrieval tasks, and I'd be hesitant
to state "just use" and imply the problem is solved.  

I thought I remembered ICUTokenizer not playing well with the CJKBigramFilter, but it appears
to be working in 6.6.

> use the ICUNormalizer
I could not agree with this more.  

-----Original Message-----
From: Davis, Daniel (NIH/NLM) [C] [mailto:daniel.davis@nih.gov] 
Sent: Tuesday, June 20, 2017 12:02 PM
To: solr-user@lucene.apache.org
Subject: RE: How are people using the ICUTokenizer?

Joel,

I think the issue is doing word-breaking according to ICU rules.   So, if you are trying to
make sure your index breaks words properly on eastern languages, just use ICU Tokenizer. 
 Unless your text is already in an ICU normal form, you should always use the ICUNormalizer
character filter along with this:

https://cwiki.apache.org/confluence/display/solr/CharFilterFactories#CharFilterFactories-solr.ICUNormalizer2CharFilterFactory

I think that this would be good with Shingles when you are not removing stop words, maybe
in an alternate analysis of the same content.

I'm using it in this way, with shingles for phrase recognition and only doc freq and term
freq - my possibly naïve idea is that I do not need positions and offsets if I'm using shingles,
and my main goal is to do a MoreLikeThis query using the shingled versions of fields.

-----Original Message-----
From: Joel Bernstein [mailto:joelsolr@gmail.com] 
Sent: Tuesday, June 20, 2017 11:52 AM
To: solr-user@lucene.apache.org
Subject: How are people using the ICUTokenizer?

It seems that there are some powerful capabilities in the ICUTokenizer. I was wondering how
the community is making use of it.

Does anyone have experience working with the ICUTokenizer that they can share?


Joel Bernstein
http://joelsolr.blogspot.com/
Mime
View raw message