lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 김남규 <kng0...@gmail.com>
Subject Re: JapaneseAnalyzer's system vs user dict
Date Sat, 25 May 2019 20:08:12 GMT
Hi, Mike :D

Japanese Analyzer does not load dictionaries by default.
If you look at the constructor, you can see that it is created as null if
not set parameters.
(check testUserDict3() in TestJapaneseAnalyzer.java)

In JapaneseTokenizer,
=============================================
if (userDictionary != null) {
  userFST = userDictionary.getFST();
  userFSTReader = userFST.getBytesReader();
} else {
  userFST = null;
  userFSTReader = null;
}
=============================================
Since it is a way to create and pass the UserDictionary object, there is no
conflict between user dictionary and system dictionary.
(You may choose only one of them! -> means userFST instance in
JapaneseTokenizer)

About dictionary,
Lucene has one pre-built dictionary by default since Lucene 3.6.
You can check it in org.apache.lucene.analysis.ja.dict.
It called MeCab which uses the Viterbi algorithm.
In Lucene, Convert MeCab dictionary(in Lucene, some dat files) to FST and
use
But it can't satisfy all users.
Depending on the situation, some user may need a custom dictionary.
It is also same for Nori(Korean Analyzer) since Lucene 7.4. (The basic
logic(MeCab + FST) is similar to Japanese Analyzer)
The original Korean MeCab dictionary size is almost 220MB, but Lucene's
dictionary size is 24MB.
If the user needs a dictionary of 100MB size, the user must build and use
it.
(Modify MeCab Dictionary -> Training -> Porting to Lucene)

If anyone find some wrong information in my reply, please send a reply with
the correction.

Thank you,
Namgyu Kim


2019년 5월 26일 (일) 오전 4:03, Michael Sokolov <msokolov@gmail.com>님이 작성:

> I'm trying to understand the relationship between the system and user
> dictionaries that JapaneseAnalyzer uses. The API allows a user to
> provide a user dictionary; the system one is built in. Are they
> otherwise the same kind of thing? If I provide entries in the user
> dictionary is it just as if I had included them in the system
> dictionary? If the same entry occurs in both, do the user dictionary
> weights supersede those in the system dictionary? Is there some way to
> suppress entries in the system dict?  I hunted for documentation, but
> didn't find answers to these questions, and the code is pretty
> involved, so any pointers would be greatly appreciated.
>
> -Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message