lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pete Lewis" <>
Subject Re: SnowballAnalyzer
Date Tue, 07 Oct 2003 07:54:57 GMT
Hi all

I know that I have no vote but I think that it would be wrong to bring the SnowballAnalyzer
into the core.

There are some distinct limitations with this pure algorithmic approach.  Yes it would be
great to say 'hey, we have 14 languages covered' but you should first realise the limitations
of the product.  Lets start with some definitions....

'Stemming' signifies the process of finding the stems in words. 'Lemmatisation' is the process
of reducing the word form to its 'lemma' form, i.e. the form one expects to find in a dictionary.
The difference are:

1.      In many language the dictionary form is not the stem. E.g. in Dutch the infinitive
verb is not its stem.

2.      Words may have several stems due to composition (common in Germanic languages).

The terms are both used extremely loosely in the literature, where they often indicate the
same thing.

A tool often used for English is the Porter-stemmer. Strictly speaking, it is neither a stemmer
nor a lemmatiser; it cuts off certain characters on the basis of characters before them. In
many cases morphologically equivalent forms reduce to the same root form. There have been
efforts to create similar type algorithmic tools for other languages. Porter has lately designed
a language called Snowball, to create scripts for performing these reductions. Snowball has
been applied for a number of languages. In many cases these scripts are available for the
public. Snowball is not capable of handling composition. Nor is it capable of handling other
more demanding morphological patterns, such as agglutination and infixes.

Basically people would expect the terms in the search clue to be reduced to the same root
form as that used for indexing and hence would then be able to find the different derivations
of the term (plurals etc).

Some examples from Snowball should speak for themselves:

bus -> bus

buses -> buse

catch -> catch

caught -> caught

manage -> manag

management -> manag

showing incorrect handling of plurals, irregs, and mixing verbs & nouns.  Obviously many
other examples can be found.

While this isn't too bad for English it gets pretty dire for other languages.

For English I'd prefer KStem rather than Snowball.



----- Original Message ----- 
From: "Erik Hatcher" <>
To: "Lucene List" <>
Sent: Monday, October 06, 2003 6:49 PM
Subject: SnowballAnalyzer

> At one point, I believe, it was proposed to bring the sandbox 
> SnowballAnalyzer into the core.  Is this still desired or shall we just 
> leave it in the sandbox?
> Erik
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message