lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lukas Vlcek (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4542) Make RECURSION_CAP in HunspellStemmer configurable
Date Wed, 18 Sep 2013 12:30:55 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13770721#comment-13770721
] 

Lukas Vlcek commented on LUCENE-4542:
-------------------------------------

IIRC the hunspell stemmer works basically the following way:

1. Assuming input token is not a root form of the word it scans affix rules (.aff file) and
try to identify possible rules that could have been used to produce the input token.
2. Apply each found rule to the input token to get one or more output tokens. The output tokens
can be considered candidates for the word in root form.
3. If any of the candidates is found in the dictionary (.dic file) and application of particular
rule is allowed (see the regexp pattern in .aff file) then bingo! If not goto #1 until RECURSION_CAP
level is reached.

This way you can have `nongoodnesses` stemmed to `good` (providing RECURSION_CAP=2). Depending
on the dictionary and affix rules you may need one pass to get from `nongoodnesses` to `goodnesses`
and then two other passes to get from `goodnesses` to `goodness` and then from `goodness`
to `good`. (Probably not the best example)

However, this is all very depending on particular dictionary and affix rules.

For example I realized that czech (ispell) or slovak (hunspell) dictionaries are constructed
in a different way (though still a way that feels natural to the language itself) and only
a single pass works best for them (although single pass does not allow for handling both prefix
AND suffix at the same time).

In my opinion there is a lot that could be improved in the hunspell token filter, but it is
more linguistic matter then algorithmic.
                
> Make RECURSION_CAP in HunspellStemmer configurable
> --------------------------------------------------
>
>                 Key: LUCENE-4542
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4542
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 4.0
>            Reporter: Piotr
>            Assignee: Steve Rowe
>             Fix For: 5.0, 4.4
>
>         Attachments: Lucene-4542-javadoc.patch, LUCENE-4542.patch, LUCENE-4542-with-solr.patch
>
>
> Currently there is 
> private static final int RECURSION_CAP = 2;
> in the code of the class HunspellStemmer. It makes using hunspell with several dictionaries
almost unusable, due to bad performance (f.ex. it costs 36ms to stem long sentence in latvian
for recursion_cap=2 and 5 ms for recursion_cap=1). It would be nice to be able to tune this
number as needed.
> AFAIK this number (2) was chosen arbitrary.
> (it's a first issue in my life, so please forgive me any mistakes done).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message