lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrien Gallou (JIRA)" <j...@apache.org>
Subject [jira] [Created] (LUCENE-8937) Avoid agressive stemming on numbers in the FrenchMinimalStemmer
Date Sun, 28 Jul 2019 19:45:00 GMT
Adrien Gallou created LUCENE-8937:
-------------------------------------

             Summary: Avoid agressive stemming on numbers in the FrenchMinimalStemmer
                 Key: LUCENE-8937
                 URL: https://issues.apache.org/jira/browse/LUCENE-8937
             Project: Lucene - Core
          Issue Type: Bug
            Reporter: Adrien Gallou


Here is the discussion on the mailing list : [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201907.mbox/browser]

The light stemmer removes the last character of a word if the last two
characters are identical.
We can see that here:
https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263
In this light stemmer, there is a check to avoid altering the token if the
token is a number.

The minimal stemmer also removes the last character of a word if the last
two characters are identical.
We can see that here:
https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77

But in this minimal stemmer there is no check to see if the character is a
letter or not.
So when we have numeric tokens with the last two characters identical they
are altered.

For example "1234567899" will be stemmed as "123456789".

It could be great of it's not altered.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message