lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tomoko Uchida <tomoko.uchida.1...@gmail.com>
Subject Re: Chinese sorting
Date Thu, 18 Dec 2014 18:16:04 GMT
Yes, sorting Kanji is not so easy as Hiragana/Kanji.

We simply expect that collators sort strings based on phonetics regardless
of how they written in (Hiragana, Katakana, Kanji.)
However a Kanji has multiple (usually 2 or 3) readings. We human naturally
make judgement which reading is suitable depending on the situation.
That makes things difficult. Maybe an ideal collator should behave and
judge like human.

Sorry for a long preamble,
I have tried ICUCollationKeyAnalyzer for Kanji, found "not so bad". Very
good compared to Unicode codepoint based sorting, but far from perfect.
I don't fully know the algorithm they use, but the accuracy might be
heavily depends on dictionaries/standards they have.

(Just an FYI,) Collators can take rules for adjustment.
http://userguide.icu-project.org/collation/api

Regards,
Tomoko




2014-12-18 18:19 GMT+09:00 Nils Knappmeier <n.knappmeier@i-views.de>:
>
> Hi Tomoko,
>
> does sorting with Locala.JAPANESE also work for Kanji. Since Hiragana and
> Katakana are based on the phonetics, I guess it is easier to define a
> sorting order. But Kanji is more similar to the Chinese.
>
> Thanks,
>   Nils
>
>
> On 17.12.2014 17:01, Tomoko Uchida wrote:
>
>> Hi, Nils,
>>
>> I don't know Chinese at all... but collation is very important in Japanese
>> too.
>> Lucene has org.apache.lucene.collation package that use ICU4J's collators
>> (you can find "lucene-analyzers-icu-4.10.2.jar" in analysis/icu
>> directory).
>> http://lucene.apache.org/core/4_10_2/analyzers-icu/index.
>> html?org/apache/lucene/collation/package-summary.html
>>
>> ICU4J also supports Chinese, of course.
>> http://site.icu-project.org/charts/collation-icu4j-sun
>>
>> I wrote a test program using ICUCollationKeyAnalyzer, it works well in
>> Japanese Hiragana/Katakana.
>> Here is a code snippet.
>>
>> Analyzer collationAnalyzer = new
>> ICUCollationKeyAnalyzer(Version.LUCENE_4_10_2,
>> Collator.getInstance(Locale.JAPANESE));
>> IndexWriter writer = new IndexWriter(dir, new
>> IndexWriterConfig(Version.LUCENE_4_10_2, collationAnalyzer));
>>
>> I understand collation is a very difficult problem, so I am not sure this
>> works for you...
>> I would appreciate if you share your trial/research.
>>
>> Regards,
>> Tomoko
>>
>> 2014-12-17 20:54 GMT+09:00 Nils Knappmeier <n.knappmeier@i-views.de>:
>>
>>> Hi,
>>>
>>> is there any implementation for a chinese collator in Lucene. I've seen
>>> that there is a chinese analyzer which uses Hidden Markov Models. But
>>> sorting seems to be an issue on its own and all my googling hasn't led to
>>> any results yet.
>>>
>>> I understand that this is not a trivial issue and I've read that the
>>> chinese tend to prefer other ordering than by name, since sorting orders
>>> are so complicated that nobody wants to use them. But we will have to
>>> sort
>>> search results by name, even though the name is chinese (simplified
>>> chinese
>>> at the moment, but traditional may also appear later) and currenty
>>> chinese
>>> words seem to be ordered by their unicode-number, which seems not to be
>>> the
>>> right order.
>>>
>>> Thanks in advance for any suggestion,
>>>   Nils
>>>
>>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message