lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Massimo Pasquini (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-6138) ItalianLightStemmer doesn't apply on words shorter then 6 chars in length
Date Sun, 28 Dec 2014 15:24:13 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-6138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259655#comment-14259655
] 

Massimo Pasquini commented on LUCENE-6138:
------------------------------------------

The issue you pointed out is related to a different stemmer for Russian language. I see no
connection to the Italian light stemmer. According to the rules of the Italian grammar, I
think the bug I found can be fixed (it possibly cannot be done in the Russian stemmer according
to what I've read on the other post).

So I suppose the ItalianLightStemmer can evolve to a better implementation: it is possible
to find some simple rules on words suffixes in order to make a decision about applying the
stemming on short words (shorter then 6 characters).

Notice my thoughts are not based on a deep and accurate study of the problem though. But I
think it could be worth to investigate about it. I may suggest to submit this issue to the
author of the code and see if he got a better solution (I saw he's in the field of computational
linguistics). According to the notes in the source, the algorithm was written in 2005 as the
result of some experiments. We don't know if they've put further efforts in investigating
the problem and they possibly wrote a best algorithm they agree to publish according to Lucene's
license.

I don't expect the stemmer to be 100% successful, but the issue I pointed out affects an important
range on words.

> ItalianLightStemmer doesn't apply on words shorter then 6 chars in length
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-6138
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6138
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 4.10.2
>            Reporter: Massimo Pasquini
>            Priority: Minor
>
> I expect a stemmer to transform nouns in their singular and plural forms into a shorter
common form. The implementation of the ItalianLightStemmer doesn't apply any stemming to words
shorter then 6 characters in length. This leads to some annoying results:
> singular form | plural form
> 4|5 chars in length (no stemming)
> alga -> alga | alghe -> alghe
> fuga -> fuga | fughe -> fughe
> lega -> lega | leghe -> leghe
> 5|6 chars in length (stemming only on plural form)
> vanga -> vanga | vanghe -> vang
> verga -> verga | verghe -> verg
> I suppose that such limitation on words length is to avoid other side effects on shorter
words not in the set above, but I think something must be reviewed in the code for better
results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message