lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andriy Rysin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-7348) Add dynamic stemmer for Ukrainian
Date Tue, 21 Jun 2016 03:52:57 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-7348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15341014#comment-15341014
] 

Andriy Rysin commented on LUCENE-7348:
--------------------------------------

[~mikemccand] Hey Michael,
I've analyzed the inflection rules we have in dict_uk project (https://github.com/arysin/dict_uk)
and it has ~4500 inflection rules (most of those are simple match but some are regexps). Those
rules cover almost all possible affixes. I can probably drop rare and homonimic ones to make
it below 4k but then the question comes up where to go next?
1) having all the rules would be nice as it'll provide high accuracy and high level of compatibility
with the dictionary-based lemmatizer created in LUCENE-7287 (we could probably even make a
hybrid solution)
2) having smaller/simpler will benefit the performance (but to simplify it properly we would
have to analyze the frequency/importance of each rule)
3) is lemmatizing analysis good or stemming is preferred? for real stemming we would have
to work more on the rules to find the (pseudo)roots for each inflection rule

I tried to look at existing light stemmers and many are very basic. It looks like we're going
in reverse and I am trying to understand if already having complex solution we want to make
it simpler (it looks that the only benefit will be performance)? I also tried to google on
how to do the stemming "right" but nothing serious jumped at me especially applicable for
Slavic languages.

Thanks.


> Add dynamic stemmer for Ukrainian
> ---------------------------------
>
>                 Key: LUCENE-7348
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7348
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Andriy Rysin
>            Priority: Minor
>              Labels: analysis, language
>
> We're adding a dictionary based lemmatizing analyzer for Ukrainian in https://issues.apache.org/jira/browse/LUCENE-7287.
> It would be nice to have a dynamic stemmer that can handle words that are not in the
dictionary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message