lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Namgyu Kim <kng0...@gmail.com>
Subject Re: JapaneseAnalyzer's system vs user dict
Date Sun, 26 May 2019 11:56:33 GMT
Sorry for the wrong information, Mike.
Tomoko is right.
I checked it wrong.

User dictionary is independent from the system dictionary. If you give
the user entries, JapaneseTokenizer builds two FSTs one for the
built-in dictionary and one for the user dictionary and they are
retrieved separately.

Please ignore the following lines in my e-mail.
================================================
Japanese Analyzer does not load dictionaries by default.
...
Since it is a way to create and pass the UserDictionary object, there is no
conflict between user dictionary and system dictionary.
(You may choose only one of them! -> means userFST instance in
JapaneseTokenizer)
=================================================

The System dictionary and the User dictionary are separated and can have
each.

About System dictionary,
As I know, it is not possible to change the System dictionary at the code
level.
The part that reads the System dictionary is hard-coded.
(TokenInfoDictionary, UnknownDictionary, BinaryDictionary)
If you really need it, can you make a JIRA issue and proceed with me?

But there is a way to build a new kuromoji jar.
1. Modify your dictionary file and rebuild.
  1-1) Install MeCab
  1-2) Install MeCab Dictionary
  1-3) Modify your dictionary file
  1-4) Make it to tar.gz
2. change kuromoji/ivy.xml from
<artifact name="ipadic" type=".tar.gz" url="
https://jaist.dl.sourceforge.net/project/mecab/mecab-ipadic/2.7.0-20070801/mecab-ipadic-2.7.0-20070801.tar.gz
"/>
to
<artifact name="ipadic" type=".tar.gz" url="file:///your/tar
path/new_dic.tar.gz"/>
3. "ant regenerate" in /your/path/lucene-solr/lucene/analysis/kuromoji
4. "ant jar"

I wish I could help you.

Warm regards,
Namgyu Kim

2019년 5월 26일 (일) 오전 9:03, Michael Sokolov <msokolov@gmail.com>님이 작성:

> Thank you for the detailed responses! What Tomoko is saying seems
> consistent with my cursory reading of the code. The reason I asked is
> I have a customer that thinks they want to replace the system
> dictionary, and I am trying to see if that is necessary. It seems as
> if for the most part, we can supply a comprehensive user dictionary
> and it would pretty much take the place of the system dictionary,
> assuming it is a superset (covers at least the original system dict
> tokens), but there is probably no way to "remove" a token that is
> present in the system dictionary (or maybe it can effectively be
> removed by adding it to user dictionary with a high penalty?). I'm not
> sure why one would want to do this removal, just trying to understand
> the design parameters.
>
> On Sat, May 25, 2019 at 7:30 PM Tomoko Uchida
> <tomoko.uchida.1111@gmail.com> wrote:
> >
> > Hi,
> >
> > > If I provide entries in the user
> > dictionary is it just as if I had included them in the system
> > dictionary? If the same entry occurs in both, do the user dictionary
> > weights supersede those in the system dictionary? Is there some way to
> > suppress entries in the system dict?
> >
> > User dictionary is independent from the system dictionary. If you give
> > the user entries, JapaneseTokenizer builds two FSTs one for the
> > built-in dictionary and one for the user dictionary and they are
> > retrieved separately.
> >
> > First the user dictionary is retrieved, and if there are no entries
> > matched then the system dictionary is retrieved. So if any entry is
> > found in the user dictionary, all possible candidates in the system
> > dictionary are ignored (suppressed).
> >
> > (I think this is kuromoji specific behaviour, the original mecab pos
> > tagger retrieves both of the system dictionary and user dictionary and
> > compares their weights by performing Viterbi. In fact the behaviour -
> > always gives priority to the entries in the user dictionary - is a bit
> > too aggressive from the point of view of the consistency of
> > tokenization. I do not know why, but there may be some performance
> > reasons?)
> >
> > I think you can easily find the retrieval logic I described here in
> > JapaneseTokenizer#parse() method. (Let me know if my understanding is
> > not correct.)
> >
> > Regards,
> > Tomoko
> >
> > 2019年5月26日(日) 5:08 김남규 <kng0828@gmail.com>:
> > >
> > > Hi, Mike :D
> > >
> > > Japanese Analyzer does not load dictionaries by default.
> > > If you look at the constructor, you can see that it is created as null
> if
> > > not set parameters.
> > > (check testUserDict3() in TestJapaneseAnalyzer.java)
> > >
> > > In JapaneseTokenizer,
> > > =============================================
> > > if (userDictionary != null) {
> > >   userFST = userDictionary.getFST();
> > >   userFSTReader = userFST.getBytesReader();
> > > } else {
> > >   userFST = null;
> > >   userFSTReader = null;
> > > }
> > > =============================================
> > > Since it is a way to create and pass the UserDictionary object, there
> is no
> > > conflict between user dictionary and system dictionary.
> > > (You may choose only one of them! -> means userFST instance in
> > > JapaneseTokenizer)
> > >
> > > About dictionary,
> > > Lucene has one pre-built dictionary by default since Lucene 3.6.
> > > You can check it in org.apache.lucene.analysis.ja.dict.
> > > It called MeCab which uses the Viterbi algorithm.
> > > In Lucene, Convert MeCab dictionary(in Lucene, some dat files) to FST
> and
> > > use
> > > But it can't satisfy all users.
> > > Depending on the situation, some user may need a custom dictionary.
> > > It is also same for Nori(Korean Analyzer) since Lucene 7.4. (The basic
> > > logic(MeCab + FST) is similar to Japanese Analyzer)
> > > The original Korean MeCab dictionary size is almost 220MB, but Lucene's
> > > dictionary size is 24MB.
> > > If the user needs a dictionary of 100MB size, the user must build and
> use
> > > it.
> > > (Modify MeCab Dictionary -> Training -> Porting to Lucene)
> > >
> > > If anyone find some wrong information in my reply, please send a reply
> with
> > > the correction.
> > >
> > > Thank you,
> > > Namgyu Kim
> > >
> > >
> > > 2019년 5월 26일 (일) 오전 4:03, Michael Sokolov <msokolov@gmail.com>님이
작성:
> > >
> > > > I'm trying to understand the relationship between the system and user
> > > > dictionaries that JapaneseAnalyzer uses. The API allows a user to
> > > > provide a user dictionary; the system one is built in. Are they
> > > > otherwise the same kind of thing? If I provide entries in the user
> > > > dictionary is it just as if I had included them in the system
> > > > dictionary? If the same entry occurs in both, do the user dictionary
> > > > weights supersede those in the system dictionary? Is there some way
> to
> > > > suppress entries in the system dict?  I hunted for documentation, but
> > > > didn't find answers to these questions, and the code is pretty
> > > > involved, so any pointers would be greatly appreciated.
> > > >
> > > > -Mike
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message