lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maciej Gawinecki <mgawine...@gmail.com>
Subject Limitations of StempelStemmer
Date Tue, 10 Sep 2019 19:30:38 GMT
Hi,

I have just checked out the latest version of Lucene from Git master branch.

I have tried to stem a few words using StempelStemmer for Polish.
However, it looks it cannot handle some words properly, e.g.

joyce -> ąć
wielce -> ąć
piwko -> ąć
royce -> ąć
pip -> ąć
xyz -> xyz

1. I surprised it cannot handle Polish words like wielce, piwko and
royce. Is this a limitation of the stemming algorithm or a training of
the algorithm or something else? The latter would help improve the
situation. How can I improve that behaviour?
2. I am surprised that for non-Polish words it returns "ać". I would
expect that for words it has not be trained for it will return their
original forms, as it happens, for instance, when stemming words like
"xyz".

With kind regards,
Maciej Gawinecki

Here's minimal example to reproduce the issue:

package org.apache.lucene.analysis;

import java.io.InputStream;
import org.apache.lucene.analysis.stempel.StempelStemmer;

public class Try {

  public static void main(String[] args) throws Exception {
    InputStream stemmerTabke = ClassLoader.getSystemClassLoader()
        .getResourceAsStream("org/apache/lucene/analysis/pl/stemmer_20000.tbl");
    StempelStemmer stemmer = new StempelStemmer(stemmerTabke);
    String[] words = {"joyce", "wielce", "piwko", "royce", "pip", "xyz"};
    for (String word : words) {
      System.out.println(String.format("%s -> %s", word,
stemmer.stem("piwko")));
    }

  }

}

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message