lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tomoko Uchida <tomoko.uchida.1...@gmail.com>
Subject Re: Question about the light and minimal French stemmers
Date Sun, 28 Jul 2019 02:35:56 GMT
Let me just make things a bit clear...
I think the concern here is that FrenchMinimalStemmer would remove the
last "digit" from a token because of it does not check if the
character is letter or not.
e.g., "123455" is trimmed to "12345" by FrenchMinimalStemmer.

To me, this behaviour is beyond stemming.

Tomoko

2019年7月28日(日) 4:55 Michael Sokolov <msokolov@gmail.com>:
>
> I'm not so sure. I think the whole idea of having both stemmers is that the
> minimal one does less than the light one.
>
> Removing the final character of a double letter suffix is going to
> sacrifice some precision. For example mes/mess, ne/née, I'm sure there are
> others.
>
> So having both options is helpful, I don't think it's a bug on the face of
> it. However I didn't look closely at the code, so I'm not sure what the
> intent is exactly.
>
> On Sat, Jul 27, 2019, 7:30 AM Tomoko Uchida <tomoko.uchida.1111@gmail.com>
> wrote:
>
> > Hi Adrien,
> >
> > To me, it sounds simply a bug. Can you please open a JIRA (with a
> > patch if possible)?
> >
> > Tomoko
> >
> > 2019年7月23日(火) 22:05 Adrien Gallou <adriengallou@gmail.com>:
> > >
> > > Hi,
> > >
> > > I'm using both light and minimal French stemmers and encountered an issue
> > > when using the minimal stemmer.
> > >
> > > The light stemmer removes the last character of a word if the last two
> > > characters are identical.
> > > We can see that here:
> > >
> > https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263
> > > In this light stemmer, there is a check to avoid altering the token if
> > the
> > > token is a number.
> > >
> > > The minimal stemmer also removes the last character of a word if the last
> > > two characters are identical.
> > > We can see that here:
> > >
> > https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77
> > >
> > > But in this minimal stemmer there is no check to see if the character is
> > a
> > > letter or not.
> > > So when we have numeric tokens with the last two characters identical
> > they
> > > are altered.
> > >
> > > Is there a reason for this?
> > > Should I file an issue on Jira to add this check?
> > >
> > > Thanks,
> > >
> > > Adrien Gallou
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message