lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy <>
Subject Re: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?
Date Mon, 04 Oct 2010 09:44:19 GMT
> > 1) hyphens - if user types "ema" or "e-ma" I want to
> > suggest "email"
> > 
> > 2) accents - if user types "herme"  want to suggest
> > "Hermès"
> Accents can be removed with using MappingCharFilterFactory
> before the tokenizer. (both index and query time)
> <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping-ISOLatin1Accent.txt"/>
> I am not sure if this is most elegant solution but you can
> replace - with "" uing MappingCharFilterFactory too. It
> satisfies what you describe in 1.
> But generally NGramFilterFactory produces a lot of tokens.
> I mean query er can return hermes. May be
> EdgeNGramFilterFactory can be more suitable for
> auto-complete task. At least it guarantees that some word is
> starting with that character sequence.


I agree with the issues with NGramFilterFactory you pointed out and I really want to avoid
using it. But the problem is that I have Chinese tags like "电吉他" and multi-lingual tags
like "electric吉他".

For tags like that WhitespaceTokenizerFactory wouldn't work. And if I use ChineseFilterFactory
would it recognize that the "electric" in "electric吉他" isn't Chinese and shouldn't be
split into individual characters?

Any ideas here are greatly appreciated.

In a related matter, I checked out
and saw that there are:

EdgeNGramFilterFactory & EdgeNGramTokenizerFactory
NGramFilterFactory & NGramTokenizerFactory

What are the differences between *FilterFactory and *TokenizerFactory? In my case which one
should I be using?



View raw message