lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] Commented: (LUCENE-2055) Fix buggy stemmers and Remove duplicate analysis functionality
Date Mon, 01 Feb 2010 09:57:51 GMT


Robert Muir commented on LUCENE-2055:

here is a short explanation of what i figure might be the controversial part: adding all the
language-specific analyzers:

I think its too difficult for a non-english user to use lucene. 
Let's take the romanian case, sure its supported by SnowballAnalyzer, but:
* where are the stopwords? if the user is smart enough they can google this and find savoy's
list... but it contains some stray nouns that should not be in there, and will they get the
encoding correct?
* for some languages: french, dutch, turkish: we already want to do something different already.
For french we need the elision filter to tokenize correctly, for dutch, the special dictionary-based
exclusions (I have been told by some any stemmer that does not handle fiets correct is useless),
for turkish we need the special lowercasing.
* for other languages: german, swedish, ... i think we REALLY want to implement decompounding
support in the future. For german at least, there is a public domain wordlist just itching
to be used for this.
* oh yeah, and all the javadocs are in english, so writing your own analyzer is another barrier
to entry.

So I think instead its best to have a "recommended default" organized by language, preferably
one we have relevance tested / or is already published. many of the existing snowball stemmers
have published relevance results available already, thus my bias towards them. Sure it won't
meet everyones needs, and users should still think about using them as a template, but I think
digging up your own stoplist / writing your own analyzer, figuring out your language support
is really buried in snowball, combined with documentation not in your native tongue, i think
this adds up to a barrier to entry that is simply too high.

> Fix buggy stemmers and Remove duplicate analysis functionality
> --------------------------------------------------------------
>                 Key: LUCENE-2055
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>             Fix For: 3.1
>         Attachments: LUCENE-2055.patch
> would like to remove stemmers in the following packages, and instead in their analyzers
use a SnowballStemFilter instead.
> * analyzers/fr
> * analyzers/nl
> * analyzers/ru
> below are excerpts from this code where they proudly proclaim they use the snowball algorithm.
> I think we should delete all of this custom stemming code in favor of the actual snowball
> {noformat}
> /**
>  * A stemmer for French words. 
>  * <p>
>  * The algorithm is based on the work of
>  * Dr Martin Porter on his snowball project<br>
>  * refer to<br>
>  * (French stemming algorithm) for details
>  * </p>
>  */
> public class FrenchStemmer {
> /**
>  * A stemmer for Dutch words. 
>  * <p>
>  * The algorithm is an implementation of
>  * the <a href="">dutch
>  * algorithm in Martin Porter's snowball project.
>  * </p>
>  */
> public class DutchStemmer {
> /**
>  * Russian stemming algorithm implementation (see for
detailed description).
>  */
> class RussianStemmer
> {noformat}

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message