lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Sokolov <msoko...@gmail.com>
Subject Re: JapaneseAnalyzer's system vs user dict
Date Sun, 26 May 2019 00:02:56 GMT
Thank you for the detailed responses! What Tomoko is saying seems
consistent with my cursory reading of the code. The reason I asked is
I have a customer that thinks they want to replace the system
dictionary, and I am trying to see if that is necessary. It seems as
if for the most part, we can supply a comprehensive user dictionary
and it would pretty much take the place of the system dictionary,
assuming it is a superset (covers at least the original system dict
tokens), but there is probably no way to "remove" a token that is
present in the system dictionary (or maybe it can effectively be
removed by adding it to user dictionary with a high penalty?). I'm not
sure why one would want to do this removal, just trying to understand
the design parameters.

On Sat, May 25, 2019 at 7:30 PM Tomoko Uchida
<tomoko.uchida.1111@gmail.com> wrote:
>
> Hi,
>
> > If I provide entries in the user
> dictionary is it just as if I had included them in the system
> dictionary? If the same entry occurs in both, do the user dictionary
> weights supersede those in the system dictionary? Is there some way to
> suppress entries in the system dict?
>
> User dictionary is independent from the system dictionary. If you give
> the user entries, JapaneseTokenizer builds two FSTs one for the
> built-in dictionary and one for the user dictionary and they are
> retrieved separately.
>
> First the user dictionary is retrieved, and if there are no entries
> matched then the system dictionary is retrieved. So if any entry is
> found in the user dictionary, all possible candidates in the system
> dictionary are ignored (suppressed).
>
> (I think this is kuromoji specific behaviour, the original mecab pos
> tagger retrieves both of the system dictionary and user dictionary and
> compares their weights by performing Viterbi. In fact the behaviour -
> always gives priority to the entries in the user dictionary - is a bit
> too aggressive from the point of view of the consistency of
> tokenization. I do not know why, but there may be some performance
> reasons?)
>
> I think you can easily find the retrieval logic I described here in
> JapaneseTokenizer#parse() method. (Let me know if my understanding is
> not correct.)
>
> Regards,
> Tomoko
>
> 2019年5月26日(日) 5:08 김남규 <kng0828@gmail.com>:
> >
> > Hi, Mike :D
> >
> > Japanese Analyzer does not load dictionaries by default.
> > If you look at the constructor, you can see that it is created as null if
> > not set parameters.
> > (check testUserDict3() in TestJapaneseAnalyzer.java)
> >
> > In JapaneseTokenizer,
> > =============================================
> > if (userDictionary != null) {
> >   userFST = userDictionary.getFST();
> >   userFSTReader = userFST.getBytesReader();
> > } else {
> >   userFST = null;
> >   userFSTReader = null;
> > }
> > =============================================
> > Since it is a way to create and pass the UserDictionary object, there is no
> > conflict between user dictionary and system dictionary.
> > (You may choose only one of them! -> means userFST instance in
> > JapaneseTokenizer)
> >
> > About dictionary,
> > Lucene has one pre-built dictionary by default since Lucene 3.6.
> > You can check it in org.apache.lucene.analysis.ja.dict.
> > It called MeCab which uses the Viterbi algorithm.
> > In Lucene, Convert MeCab dictionary(in Lucene, some dat files) to FST and
> > use
> > But it can't satisfy all users.
> > Depending on the situation, some user may need a custom dictionary.
> > It is also same for Nori(Korean Analyzer) since Lucene 7.4. (The basic
> > logic(MeCab + FST) is similar to Japanese Analyzer)
> > The original Korean MeCab dictionary size is almost 220MB, but Lucene's
> > dictionary size is 24MB.
> > If the user needs a dictionary of 100MB size, the user must build and use
> > it.
> > (Modify MeCab Dictionary -> Training -> Porting to Lucene)
> >
> > If anyone find some wrong information in my reply, please send a reply with
> > the correction.
> >
> > Thank you,
> > Namgyu Kim
> >
> >
> > 2019년 5월 26일 (일) 오전 4:03, Michael Sokolov <msokolov@gmail.com>님이
작성:
> >
> > > I'm trying to understand the relationship between the system and user
> > > dictionaries that JapaneseAnalyzer uses. The API allows a user to
> > > provide a user dictionary; the system one is built in. Are they
> > > otherwise the same kind of thing? If I provide entries in the user
> > > dictionary is it just as if I had included them in the system
> > > dictionary? If the same entry occurs in both, do the user dictionary
> > > weights supersede those in the system dictionary? Is there some way to
> > > suppress entries in the system dict?  I hunted for documentation, but
> > > didn't find answers to these questions, and the code is pretty
> > > involved, so any pointers would be greatly appreciated.
> > >
> > > -Mike
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message