lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Namgyu Kim (Jira)" <>
Subject [jira] [Commented] (LUCENE-8966) KoreanTokenizer should split unknown words on digits
Date Wed, 11 Sep 2019 19:20:00 GMT


Namgyu Kim commented on LUCENE-8966:

Oh, Thank you for your reply. [~jim.ferenczi] :D

I checked again and it was not bug.
 That result is come from viterbi path.

But I think it needs to be discussed.
 So I added a new issue about it. 

I'd appreciate if you check LUCENE-8977.

P.S. +1 to your patch

> KoreanTokenizer should split unknown words on digits
> ----------------------------------------------------
>                 Key: LUCENE-8966
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Jim Ferenczi
>            Priority: Minor
>         Attachments: LUCENE-8966.patch, LUCENE-8966.patch
> Since the Korean tokenizer groups
characters of unknown words if they belong to the same script or an inherited one. This is
ok for inputs like Мoscow (with a Cyrillic М and the rest in Latin) but this rule doesn't
work well on digits since they are considered common with other scripts. For instance the
input "44사이즈" is kept as is even though "사이즈" is part of the dictionary. We should
restore the original behavior and splits any unknown words if a digit is followed by another
> This issue was first discovered in []

This message was sent by Atlassian Jira

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message