lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tomoko Uchida <tomoko.uchida.1...@gmail.com>
Subject Re: JapaneseAnalyzer's system vs user dict
Date Sat, 25 May 2019 23:30:14 GMT
Hi,

> If I provide entries in the user
dictionary is it just as if I had included them in the system
dictionary? If the same entry occurs in both, do the user dictionary
weights supersede those in the system dictionary? Is there some way to
suppress entries in the system dict?

User dictionary is independent from the system dictionary. If you give
the user entries, JapaneseTokenizer builds two FSTs one for the
built-in dictionary and one for the user dictionary and they are
retrieved separately.

First the user dictionary is retrieved, and if there are no entries
matched then the system dictionary is retrieved. So if any entry is
found in the user dictionary, all possible candidates in the system
dictionary are ignored (suppressed).

(I think this is kuromoji specific behaviour, the original mecab pos
tagger retrieves both of the system dictionary and user dictionary and
compares their weights by performing Viterbi. In fact the behaviour -
always gives priority to the entries in the user dictionary - is a bit
too aggressive from the point of view of the consistency of
tokenization. I do not know why, but there may be some performance
reasons?)

I think you can easily find the retrieval logic I described here in
JapaneseTokenizer#parse() method. (Let me know if my understanding is
not correct.)

Regards,
Tomoko

2019年5月26日(日) 5:08 김남규 <kng0828@gmail.com>:
>
> Hi, Mike :D
>
> Japanese Analyzer does not load dictionaries by default.
> If you look at the constructor, you can see that it is created as null if
> not set parameters.
> (check testUserDict3() in TestJapaneseAnalyzer.java)
>
> In JapaneseTokenizer,
> =============================================
> if (userDictionary != null) {
>   userFST = userDictionary.getFST();
>   userFSTReader = userFST.getBytesReader();
> } else {
>   userFST = null;
>   userFSTReader = null;
> }
> =============================================
> Since it is a way to create and pass the UserDictionary object, there is no
> conflict between user dictionary and system dictionary.
> (You may choose only one of them! -> means userFST instance in
> JapaneseTokenizer)
>
> About dictionary,
> Lucene has one pre-built dictionary by default since Lucene 3.6.
> You can check it in org.apache.lucene.analysis.ja.dict.
> It called MeCab which uses the Viterbi algorithm.
> In Lucene, Convert MeCab dictionary(in Lucene, some dat files) to FST and
> use
> But it can't satisfy all users.
> Depending on the situation, some user may need a custom dictionary.
> It is also same for Nori(Korean Analyzer) since Lucene 7.4. (The basic
> logic(MeCab + FST) is similar to Japanese Analyzer)
> The original Korean MeCab dictionary size is almost 220MB, but Lucene's
> dictionary size is 24MB.
> If the user needs a dictionary of 100MB size, the user must build and use
> it.
> (Modify MeCab Dictionary -> Training -> Porting to Lucene)
>
> If anyone find some wrong information in my reply, please send a reply with
> the correction.
>
> Thank you,
> Namgyu Kim
>
>
> 2019년 5월 26일 (일) 오전 4:03, Michael Sokolov <msokolov@gmail.com>님이
작성:
>
> > I'm trying to understand the relationship between the system and user
> > dictionaries that JapaneseAnalyzer uses. The API allows a user to
> > provide a user dictionary; the system one is built in. Are they
> > otherwise the same kind of thing? If I provide entries in the user
> > dictionary is it just as if I had included them in the system
> > dictionary? If the same entry occurs in both, do the user dictionary
> > weights supersede those in the system dictionary? Is there some way to
> > suppress entries in the system dict?  I hunted for documentation, but
> > didn't find answers to these questions, and the code is pretty
> > involved, so any pointers would be greatly appreciated.
> >
> > -Mike
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message