lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arjen van der Meijden <acmmail...@tweakers.net>
Subject Re: How to handle words that stem to stop words
Date Thu, 10 Jul 2014 18:57:08 GMT
I'm reluctant to apply either solution:

Emitting both tokens will likely still provide the user with a very long 
result list. Even though the results with 'vans' in it are likely to be 
ranked to the top, its still not very user friendly due to its 
overwhelmingly large number of results (nor is it very good for the 
performance of my application).
In our specific case we also boost documents based on their age and 
popularity, so the extra results will probably interfere even if 
'vans'-results are generally ranked higher.


The approach with a list of specially treated terms is something we'll 
have to build and maintain by hand. Every time such a list is adjusted, 
it'll require a reindex of the database, which is not a huge problem but 
still not very practical.

But I'm getting more and more convinced there isn't really a (reasonably 
easy) solution that would leave it dynamically changing without 
requiring database reindexes.
Luckily the list of stop words shouldn't change that fast and we already 
have more than ten years worth of data, so it should be fairly easy to 
build a list of terms that are stemmed into stop words.

Best regards,

Arjen

On 7-7-2014 23:06 Tri Cao wrote:
> I think emitting two tokens for "vans" is the right (potentially only)
> way to do it. You could
> also control the dictionary of terms that require this special treatment.
>
> Any reason makes you not happy with this approach?
>
> On Jul 06, 2014, at 11:48 AM, Arjen van der Meijden
> <acmmailing@tweakers.net> wrote:
>
>> Hello list,
>>
>> We have a fairly large Lucene database for a 30+ million post forum.
>> Users post and search for all kinds of things. To make sure users don't
>> have to type exact matches, we combine a WordDelimiterFilter with a
>> (Dutch) SnowballFilter.
>>
>> Unfortunately users sometimes find examples of words that get stemmed to
>> a word that's basically a stop word. Or reversely, where a very common
>> word is stemmed so that it becomes the same as a rare word.
>>
>> We do index stop words, so theoretically they could still find their
>> result. But when a rare word is stemmed in such a way it yields a
>> million hits, that makes it very unusable...
>>
>> One example is the Dutch word 'van' which is the equivalent of 'of' in
>> English. A user tried to search for the shoe brand 'vans', which gets
>> stemmed to 'van' and obviously gives useless results.
>>
>> I already noticed the 'KeywordRepeatFilter' to index/search both 'vans'
>> and 'van' and the StemmerOverrideFilter to try and prevent these cases.
>> Are there any other solutions for these kinds of problems?
>>
>> Best regards,
>>
>> Arjen van der Meijden
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> <mailto:java-user-unsubscribe@lucene.apache.org>
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> <mailto:java-user-help@lucene.apache.org>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message