lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erik Hatcher (JIRA)" <>
Subject [jira] Resolved: (LUCENE-461) StandardTokenizer splitting all of Korean words into separate characters
Date Sat, 12 Nov 2005 08:38:03 GMT
     [ ]
Erik Hatcher resolved LUCENE-461:

    Fix Version: 1.9
     Resolution: Fixed

These patches have been applied, thanks! 

There is one thing to note, and that is a change in the token type emitted from "<CJK>"
to "<CJ>".  It is possible that folks have written code to rely on that, but this token
type is currently brittle as it is based on the JavaCC grammar definition and I view this
as an acceptable break in full backwards compatibility because it is unlikely that anyone
is using that token type.

> StandardTokenizer splitting all of Korean words into separate characters
> ------------------------------------------------------------------------
>          Key: LUCENE-461
>          URL:
>      Project: Lucene - Java
>         Type: Bug
>   Components: Analysis
>  Environment: Analyzing Korean text with Apache Lucene, esp. with StandardAnalyzer.
>     Reporter: Cheolgoo Kang
>     Priority: Minor
>      Fix For: 1.9
>  Attachments: StandardTokenizer_KoreanWord.patch, TestStandardAnalyzer_KoreanWord.patch
> StandardTokenizer splits all those Korean words inth separate character tokens. For example,
"?????" is one Korean word that means "Hello", but StandardAnalyzer separates it into five
tokens of "?", "?", "?", "?", "?".

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message