lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Namgyu Kim (JIRA)" <>
Subject [jira] [Created] (LUCENE-8553) New KoreanDecomposeFilter for KoreanAnalyzer(Nori)
Date Thu, 01 Nov 2018 16:13:00 GMT
Namgyu Kim created LUCENE-8553:

             Summary: New KoreanDecomposeFilter for KoreanAnalyzer(Nori)
                 Key: LUCENE-8553
             Project: Lucene - Core
          Issue Type: New Feature
          Components: modules/analysis
            Reporter: Namgyu Kim

This is a patch for KoreanDecomposeFilter.

This filter can be used to decompose Hangul.
(ex) 한글 -> ㅎㄱ or ㅎㅏㄴㄱㅡㄹ)

Hangul input is very unique.

If you want to type apple in English,
   you can type it in the order {color:#FF0000}a -> p -> p -> l -> e{color}.

However, if you want to input "Hangul" in Hangul,
   you have to type it in the order of {color:#FF0000}ㅎ -> ㅏ -> ㄴ -> ㄱ
-> ㅡ -> ㄹ{color}.
   (Because of the keyboard shape)

This means that spell check with existing full Hangul can be less accurate.


The structure of Hangul consists of elements such as *"Choseong"*, *"Jungseong"*, and *"Jongseong"*.

These three elements are called *"Jamo"*.

If you have the Korean word "된장찌개" (that means Soybean Paste Stew)
*"Choseong"* means {color:#FF0000}"ㄷ, ㅈ, ㅉ, ㄱ"{color},
*"Jungseong"* means {color:#FF0000}"ㅚ, ㅏ, ㅣ, ㅐ"{color},
*"Jongseong"* means {color:#FF0000}"ㄴ, ㅇ"{color}.

The reason for Jamo separation is explained above. (spell check)

Also, the reason we need "Choseong Filter" is because many Koreans use *"Choseong Search"*
(especially in mobile environment).
If you want to search for "된장찌개" you need 10 typing, which is quite a lot.
For that reason, I think it would be useful to provide a filter that can be searched by "ㄷㅈㅉㄱ".

Hangul also has *dual chars*, such as
"ㄲ, ㄸ, ㅁ, ㅃ, ㅉ, ㅚ (ㅗ + ㅣ), ㅢ (ㅡ + ㅣ), ...".

For such reasons,
KoreanDecompose offers *5 options*,

ex) *된장찌개* => [된장], [찌개]

[된장], [찌개]

[ㄷㅈ], [ㅉㄱ] 

[ㄷㅈ], [ㅈㅈㄱ] 

[ㄷㅚㄴㅈㅏㅇ], [ㅉㅣㄱㅐ] 

[ㄷㅗㅣㄴㅈㅏㅇ], [ㅈㅈㅣㄱㅐ] 


This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message