lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Namgyu Kim (JIRA)" <j...@apache.org>
Subject [jira] [Created] (LUCENE-8553) New KoreanDecomposeFilter for KoreanAnalyzer(Nori)
Date Thu, 01 Nov 2018 16:13:00 GMT
Namgyu Kim created LUCENE-8553:
----------------------------------

             Summary: New KoreanDecomposeFilter for KoreanAnalyzer(Nori)
                 Key: LUCENE-8553
                 URL: https://issues.apache.org/jira/browse/LUCENE-8553
             Project: Lucene - Core
          Issue Type: New Feature
          Components: modules/analysis
            Reporter: Namgyu Kim


This is a patch for KoreanDecomposeFilter.

This filter can be used to decompose Hangul.
(ex) 한글 -> ㅎㄱ or ㅎㅏㄴㄱㅡㄹ)

Hangul input is very unique.

If you want to type apple in English,
   you can type it in the order {color:#FF0000}a -> p -> p -> l -> e{color}.

However, if you want to input "Hangul" in Hangul,
   you have to type it in the order of {color:#FF0000}ㅎ -> ㅏ -> ㄴ -> ㄱ
-> ㅡ -> ㄹ{color}.
   (Because of the keyboard shape)

This means that spell check with existing full Hangul can be less accurate.

 

The structure of Hangul consists of elements such as *"Choseong"*, *"Jungseong"*, and *"Jongseong"*.

These three elements are called *"Jamo"*.

If you have the Korean word "된장찌개" (that means Soybean Paste Stew)
*"Choseong"* means {color:#FF0000}"ㄷ, ㅈ, ㅉ, ㄱ"{color},
*"Jungseong"* means {color:#FF0000}"ㅚ, ㅏ, ㅣ, ㅐ"{color},
*"Jongseong"* means {color:#FF0000}"ㄴ, ㅇ"{color}.

The reason for Jamo separation is explained above. (spell check)

Also, the reason we need "Choseong Filter" is because many Koreans use *"Choseong Search"*
(especially in mobile environment).
If you want to search for "된장찌개" you need 10 typing, which is quite a lot.
For that reason, I think it would be useful to provide a filter that can be searched by "ㄷㅈㅉㄱ".

Hangul also has *dual chars*, such as
"ㄲ, ㄸ, ㅁ, ㅃ, ㅉ, ㅚ (ㅗ + ㅣ), ㅢ (ㅡ + ㅣ), ...".

For such reasons,
KoreanDecompose offers *5 options*,

ex) *된장찌개* => [된장], [찌개]

*1) ORIGIN*
[된장], [찌개]

*2) SINGLECHOSEONG*
[ㄷㅈ], [ㅉㄱ] 

*3) DUALCHOSEONG*
[ㄷㅈ], [ㅈㅈㄱ] 

*4) SINGLEJAMO*
[ㄷㅚㄴㅈㅏㅇ], [ㅉㅣㄱㅐ] 

*5) DUALJAMO*
[ㄷㅗㅣㄴㅈㅏㅇ], [ㅈㅈㅣㄱㅐ] 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message