lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-5484) Distinct control of recursion levels for prefix and suffix in Hunspell.
Date Sun, 02 Mar 2014 11:31:21 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-5484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13917369#comment-13917369
] 

Robert Muir commented on LUCENE-5484:
-------------------------------------

Just to explain the problem a bit more (for example Czech).

You should never recurse, unless the affix actually allows for it with a continuation class.
The czech ones don't! So it should never strip more than one suffix there. And even with those
its max 1 prefix, max 2 suffixes, unless COMPLEXPREFIXES is specified, then its the other
way around (max 2 prefixes, max 1 suffixes).

Thats why i say the parameter is no longer needed. For your czech dictionary (and many others
now), you won't find any differences between 'hunspell -m' and the code in trunk now.

> Distinct control of recursion levels for prefix and suffix in Hunspell.
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-5484
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5484
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Lukas Vlcek
>            Priority: Minor
>
> Currently, there is an option to set recursionCap value to control depth of recursion
in Hunspell token filter. This recursion enables to apply allowed affix rule to input token
and pass output token(s) as an input tokens recursively.
> However, the recursionCap does not allow to distinguish between how many prefix and suffix
rules were applied. It just counts for total. For example if recursionCap is set to 1 it actually
includes all of the following options:
> - 2 prefix rules, 0 suffix rules
> - 1prefix rule, 1 suffix rule
> - 0 prefix rules, 2 suffix rules
> In some cases it is required to be able to distinguish between prefix rule and suffix
rule and have finer control over how many times is each applied. Requested feature should
allow setting recursion level separately for prefix and suffix rules.
> Specific example is the Czech dictionary, where it gives best results if suffix rules
are applied only once. Hence recursionCap = 0. But if for input token a prefix rule is applied
it does not allow to apply suffix rule and produces a token that is not in root form. And
setting recursionCap = 1 produces too many irrelevant tokens that it makes Hunspell token
filter unuseful. Good solution to this problem would be tell Hunspell token filter to apply
up to 1 prefix rule and up to 1 suffix rule only (meaning never allow to apply 0 prefix rules
and 2 suffix rules).
> Generally, this is probably dependant a lot on how particular dictionary and affix rules
are constructed and it might not be considered a generalization but rather an expert feature.
> (There was some relevant discussion going on in LUCENE-5468)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message