lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lukas Vlcek (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-5057) Hunspell stemmer generates multiple tokens
Date Thu, 05 Sep 2013 09:18:52 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13758927#comment-13758927
] 

Lukas Vlcek commented on LUCENE-5057:
-------------------------------------

I am not a linguist expert but here are my thoughts. Generally, with highly inflected languages
and short words there is high chance that some word form of root word A will be similar to
some word form of root word B. The sorter the word and the more the language is inflected
the higher the chance. Isn't this true? I can give you some examples from Czech language:

A word "den". When you run this word through Hunspell token filter (with recusion level 0)
using Czech dictionary (you can find it in attachments of #LUCENE-4311) it outputs three different
tokens:

[ "den", "dno", "dna" ]

Where
 - "den" is singular nominative case [1] of "a day". Thus output is "den".
 - "den" is a plural genitive case [2] of "a bottom" or "a base". Thus output is "dno".
 - "den" is a plural genitive case of "a goat". Thus output is "dna".

I do not see this as an dictionary issue (contrary I would argue that affix rules did very
good job). When you get the token "den" without any context you really do not know which of
these three meanings it can have.

You can check another example (including Elasticsearch queries) in my article [3] at the very
bottom. In Elasticsearch terminology the "match query" does not work correctly, while "query
string" seems to be doing fine.

Let me know if you have any further questions.

[1] http://en.wikipedia.org/wiki/Nominative_case
[2] http://en.wikipedia.org/wiki/Genitive
[3] http://www.zdrojak.cz/clanky/elasticsearch-vyhledavame-hezky-cesky-ii-a-taky-slovensky/
                
> Hunspell stemmer generates multiple tokens
> ------------------------------------------
>
>                 Key: LUCENE-5057
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5057
>             Project: Lucene - Core
>          Issue Type: Improvement
>    Affects Versions: 4.3
>            Reporter: Luca Cavanna
>            Assignee: Adrien Grand
>
> The hunspell stemmer seems to be generating multiple tokens: the original token plus
the available stems.
> It might be a good thing in some cases but it seems to be a different behaviour compared
to the other stemmers and causes problems as well. I would rather have an option to decide
whether it should output only the available stems, or the stems plus the original token. I'm
not sure though if it's possible to have only a single stem indexed, which would be even better
in my opinion. When I look at how snowball works only one token is indexed, the stem, and
that works great. Probably there's something I'm missing in how hunspell works.
> Here is my issue: I have a query composed of multiple terms, which is analyzed using
stemming and a boolean query is generated out of it. All fine when adding all clauses as should
(OR operator), but if I add all clauses as must (AND operator), then I can get back only the
documents that contain the stem originated by the exactly same original word.
> Example for the dutch language I'm working with: fiets (means bicycle in dutch), its
plural is fietsen.
> If I index "fietsen" I get both "fietsen" and "fiets" indexed, but if I index "fiets"
I get the only "fiets" indexed.
> When I query for "fietsen whatever" I get the following boolean query: field:fiets field:fietsen
field:whatever.
> If I apply the AND operator and use must clauses for each subquery, then I can only find
the documents that originally contained "fietsen", not the ones that originally contained
"fiets", which is not really what stemming is about.
> Any thoughts on this? I also wonder if it can be a dictionary issue since I see that
different words that have the word "fiets" as root don't get the same stems, and using the
AND operator at query time is a big issue.
> I would love to contribute on this and looking forward to your feedback.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message