lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tomoko Uchida <tomoko.uchida.1...@gmail.com>
Subject Re: Chinese sorting
Date Wed, 17 Dec 2014 16:01:37 GMT
Hi, Nils,

I don't know Chinese at all... but collation is very important in Japanese
too.
Lucene has org.apache.lucene.collation package that use ICU4J's collators
(you can find "lucene-analyzers-icu-4.10.2.jar" in analysis/icu directory).
http://lucene.apache.org/core/4_10_2/analyzers-icu/index.html?org/apache/lucene/collation/package-summary.html

ICU4J also supports Chinese, of course.
http://site.icu-project.org/charts/collation-icu4j-sun

I wrote a test program using ICUCollationKeyAnalyzer, it works well in
Japanese Hiragana/Katakana.
Here is a code snippet.

Analyzer collationAnalyzer = new
ICUCollationKeyAnalyzer(Version.LUCENE_4_10_2,
Collator.getInstance(Locale.JAPANESE));
IndexWriter writer = new IndexWriter(dir, new
IndexWriterConfig(Version.LUCENE_4_10_2, collationAnalyzer));

I understand collation is a very difficult problem, so I am not sure this
works for you...
I would appreciate if you share your trial/research.

Regards,
Tomoko

2014-12-17 20:54 GMT+09:00 Nils Knappmeier <n.knappmeier@i-views.de>:
>
> Hi,
>
> is there any implementation for a chinese collator in Lucene. I've seen
> that there is a chinese analyzer which uses Hidden Markov Models. But
> sorting seems to be an issue on its own and all my googling hasn't led to
> any results yet.
>
> I understand that this is not a trivial issue and I've read that the
> chinese tend to prefer other ordering than by name, since sorting orders
> are so complicated that nobody wants to use them. But we will have to sort
> search results by name, even though the name is chinese (simplified chinese
> at the moment, but traditional may also appear later) and currenty chinese
> words seem to be ordered by their unicode-number, which seems not to be the
> right order.
>
> Thanks in advance for any suggestion,
>  Nils
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message