lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arjen van der Meijden <acmmail...@tweakers.net>
Subject How to handle words that stem to stop words
Date Sun, 06 Jul 2014 18:47:37 GMT
Hello list,

We have a fairly large Lucene database for a 30+ million post forum. 
Users post and search for all kinds of things. To make sure users don't 
have to type exact matches, we combine a WordDelimiterFilter with a 
(Dutch) SnowballFilter.

Unfortunately users sometimes find examples of words that get stemmed to 
a word that's basically a stop word. Or reversely, where a very common 
word is stemmed so that it becomes the same as a rare word.

We do index stop words, so theoretically they could still find their 
result. But when a rare word is stemmed in such a way it yields a 
million hits, that makes it very unusable...

One example is the Dutch word 'van' which is the equivalent of 'of' in 
English. A user tried to search for the shoe brand 'vans', which gets 
stemmed to 'van' and obviously gives useless results.

I already noticed the 'KeywordRepeatFilter' to index/search both 'vans' 
and 'van' and the StemmerOverrideFilter to try and prevent these cases. 
Are there any other solutions for these kinds of problems?

Best regards,

Arjen van der Meijden

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message