lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <>
Subject Re: How to handle words that stem to stop words
Date Mon, 07 Jul 2014 21:31:11 GMT
Some of these anomalous cases are best handled by simply suppressing 
stemming, using PatternKeywordMarkerFilter and SetKeywordMarkerFilter, to 
set the keyword attribute for matching tokens and then most stemmers will 
not change them.

You can create a list of words to ignore, like plurals of your stop words, 
or possibly a pattern that matches stop words plus a short suffix that might 
get stemmed.

-- Jack Krupansky

-----Original Message----- 
From: Arjen van der Meijden
Sent: Sunday, July 6, 2014 2:47 PM
Subject: How to handle words that stem to stop words

Hello list,

We have a fairly large Lucene database for a 30+ million post forum.
Users post and search for all kinds of things. To make sure users don't
have to type exact matches, we combine a WordDelimiterFilter with a
(Dutch) SnowballFilter.

Unfortunately users sometimes find examples of words that get stemmed to
a word that's basically a stop word. Or reversely, where a very common
word is stemmed so that it becomes the same as a rare word.

We do index stop words, so theoretically they could still find their
result. But when a rare word is stemmed in such a way it yields a
million hits, that makes it very unusable...

One example is the Dutch word 'van' which is the equivalent of 'of' in
English. A user tried to search for the shoe brand 'vans', which gets
stemmed to 'van' and obviously gives useless results.

I already noticed the 'KeywordRepeatFilter' to index/search both 'vans'
and 'van' and the StemmerOverrideFilter to try and prevent these cases.
Are there any other solutions for these kinds of problems?

Best regards,

Arjen van der Meijden

To unsubscribe, e-mail:
For additional commands, e-mail: 

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message