lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dawid Weiss <dawid.we...@gmail.com>
Subject Re: Limitations of StempelStemmer
Date Wed, 11 Sep 2019 06:27:55 GMT
Hi Maciej,

Stempel uses a pretrained heuristic. You can find a longer description
at [1] and [2]. The specific reason for the problems you mentioned may
be the smaller training dictionary used for the version embedded in
Lucene, I honestly don't know. If you need exact stemming/
lemmatization then take a look at dictionary methods -- Morfologik or
the tools listed at [3].

Dawid

[1] http://www.getopt.org/stempel/
[2] https://lucene.apache.org/core/8_2_0/analyzers-stempel/index.html
[3] http://zil.ipipan.waw.pl/

On Tue, Sep 10, 2019 at 9:31 PM Maciej Gawinecki <mgawinecki@gmail.com> wrote:
>
> Hi,
>
> I have just checked out the latest version of Lucene from Git master branch.
>
> I have tried to stem a few words using StempelStemmer for Polish.
> However, it looks it cannot handle some words properly, e.g.
>
> joyce -> ąć
> wielce -> ąć
> piwko -> ąć
> royce -> ąć
> pip -> ąć
> xyz -> xyz
>
> 1. I surprised it cannot handle Polish words like wielce, piwko and
> royce. Is this a limitation of the stemming algorithm or a training of
> the algorithm or something else? The latter would help improve the
> situation. How can I improve that behaviour?
> 2. I am surprised that for non-Polish words it returns "ać". I would
> expect that for words it has not be trained for it will return their
> original forms, as it happens, for instance, when stemming words like
> "xyz".
>
> With kind regards,
> Maciej Gawinecki
>
> Here's minimal example to reproduce the issue:
>
> package org.apache.lucene.analysis;
>
> import java.io.InputStream;
> import org.apache.lucene.analysis.stempel.StempelStemmer;
>
> public class Try {
>
>   public static void main(String[] args) throws Exception {
>     InputStream stemmerTabke = ClassLoader.getSystemClassLoader()
>         .getResourceAsStream("org/apache/lucene/analysis/pl/stemmer_20000.tbl");
>     StempelStemmer stemmer = new StempelStemmer(stemmerTabke);
>     String[] words = {"joyce", "wielce", "piwko", "royce", "pip", "xyz"};
>     for (String word : words) {
>       System.out.println(String.format("%s -> %s", word,
> stemmer.stem("piwko")));
>     }
>
>   }
>
> }
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message