lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tomoko Uchida <tomoko.uchida.1...@gmail.com>
Subject Re: JapaneseAnalyzer's system vs user dict
Date Tue, 28 May 2019 13:43:14 GMT
Hi guys,

I just created an issue related to this thread.

Decouple Kuromoji's morphological analyser and its dictionary
https://issues.apache.org/jira/browse/LUCENE-8816

The problem discussed here is essentially within the current
architecture of Kuromoji (and Nori), "jar bundled system dictionary".
So, the most natural solution is decoupling the Viterbi logic and the
encoded dictionary (just as traditional Japanese morphological
analysis engines do so).
This is actually old question with respect to kuromoji, however I feel
like that it's a good time to re-think it.

It will take time (and to be honest I'm not sure the patch will be
accepted) but I think it's much better than applying monkey-fixes to
the current build script.
If you are seriously interested in this work, please feel free to involve it.

Tomoko

2019年5月28日(火) 7:57 Tomoko Uchida <tomoko.uchida.1111@gmail.com>:
>
> Hi Namgyu,
>
> > There is a team that uses a well-ported system dictionary.
> > The Lucene version is up. (like 8.1 -> 8.2)
> > Suppose there was no modification to kuromoji in 8.2.
> > But the user has to port again.
> > The same goes for 8.2 to 8.3.
>
> I'm not sure about the situation at Korea, however, we also have some
> frequently updated, well-maintained (by NLP professionals) system
> dictionaries.
> 1. neologd (mecab ipadic extension) and 2. Sudachi (unidic extension
> partially including neologd) I mentioned in my previous mail.
> I agree with that it's a labor to re-build the tokenizer every time
> when upgrading.
>
> In both case, some outstanding contributors build and distribute
> plugins including up-to-date dictionary at a constant pace, and other
> users just use them. Seems this works greatly at least in Japan, for
> now.
> Maybe we can start from outside of Lucene project such like that? If
> the workflow works well and it's really needed, developers can propose
> the change (a patch for the build script, and possibly the system
> dictionary operation or update policy is also needed) to the Jira
> anytime.
>
> I know that current JapaneseAnalyzer's system dictionary (MeCab
> IPADIC) has been not maintained for ten years and developers/users
> often complain about it.
> For now I just see the effort of the developers community (including
> me) to try to find good solutions for that.
>
> Thanks,
> Tomoko
>
> 2019年5月28日(火) 2:42 Namgyu Kim <kng0828@gmail.com>:
> >
> > Thank you for your reply, Tomoko :D
> >
> > To be honest, I have not experienced it directly(means commercialize), so I
> > can't tell the exact situation of the Japanese MeCab.
> > I respect your opinion and it is true that customization is a difficult
> > task.
> >
> > But I can talk a little bit about Korean MeCab. (The basic logic is the
> > same)
> > In the case of Hangul MeCab, system dictionary changes are very frequent.
> > Developers do not design the engine from the bottom, so they tend to try a
> > lot of tuning at some level. (like custom model, score matrix, custom
> > dictionary)
> > Especially in commercialization, developers make a lot of tuning to make
> > the dictionary that is the most suitable for the purpose.
> > (Of course, the big tech companies use their own analyzers :D)
> >
> > MeCab is especially popular in Korea, so there are many attempts.
> > Developers often port it to Elasticsearch and use a lot, but they have to
> > do a lot of boring work every time.
> > (It is not Korean MeCab case, but I think Mike and Trejkaz talked in that
> > sense)
> >
> > There is another bad case.
> >
> > There is a team that uses a well-ported system dictionary.
> > The Lucene version is up. (like 8.1 -> 8.2)
> > Suppose there was no modification to kuromoji in 8.2.
> > But the user has to port again.
> > The same goes for 8.2 to 8.3.
> > Even if kuromoji has a fix that is not associated with Dictionary, the user
> > has to port each time.
> >
> > At least if we allow them to read custom dat files, these problems can be
> > disappeared.
> >
> > Warm regards,
> > Namgyu Kim
> >
> > On Mon, May 27, 2019 at 8:21 AM Tomoko Uchida <tomoko.uchida.1111@gmail.com>
> > wrote:
> >
> > > > Anyway, in my personal opinion, Lucene does not need to consider whether
> > > the system dictionary status is good or not.
> > >
> > > Please don't get me wrong, but I don't think so.
> > > Creating a customized or re-trained system dictionary still needs deep
> > > knowledge about language and machine-learning. Even among in us,
> > > native Japanese, very few people can do so.
> > > The system dictionary is a key component for tokenization, so badly
> > > customized system dictionary directly affects to the search quality
> > > and I think we should prevent it. Instead of messing up the system
> > > dictionary without sufficient knowledge, please use the user
> > > dictionary. That is the reason why it exists.
> > >
> > > Anyway building the system dictionary (MeCab IPADIIC extensions), you
> > > do not need read or fix the DictionaryBuilder class.
> > > Just modify analysis/kuromoji/build.xml to use the
> > > customized/re-trained dictionary (tar ball).
> > >
> > > Tomoko
> > >
> > > 2019年5月27日(月) 1:48 Namgyu Kim <kng0828@gmail.com>:
> > > >
> > > > Oh, I think my explanation was not enough. Sorry...
> > > >
> > > > I mentioned the following sentences.
> > > > =============================
> > > > 1. Modify your dictionary file and rebuild.
> > > >   1-1) Install MeCab
> > > >   1-2) Install MeCab Dictionary
> > > >   1-3) Modify your dictionary file
> > > >   1-4) Make it to tar.gz
> > > > =============================
> > > > The "1-3)" does not mean user modifies the csv files and compresses it
> > > back
> > > > to tar.gz.
> > > > It means re-training, of course user has to be careful and have knowledge
> > > > of the Natural Language Processing.
> > > > Column 2, 3 and 4 in csv values are the values produced by training.
> > > > (2 : left context id, 3 : right context id, 4 : cost)
> > > > These values are dependent on the model and matrix.def values. (when use
> > > > mecab-dict-index)
> > > >
> > > > That's why I mentioned "1-1)" and "1-2)" processes first.
> > > >
> > > > Anyway, in my personal opinion, Lucene does not need to consider whether
> > > > the system dictionary status is good or not.
> > > > I just think when some user wants to use a custom system dictionary, it
> > > is
> > > > not user-friendly to modify the ant file or find some code for a long
> > > time
> > > > to run the DictionaryBuilder.
> > > > I think there should be at least a guide.
> > > >
> > > > Warm regards,
> > > > Namgyu Kim
> > > >
> > > > P.S. Although not as good as the Tomoko's contents, there is a list of
> > > > dictionaries supported by kuromoji.
> > > > (https://github.com/atilika/kuromoji#supported-dictionaries)
> > > >
> > > >
> > > > 2019년 5월 27일 (월) 오전 12:12, Tomoko Uchida <tomoko.uchida.1111@gmail.com
> > > >님이
> > > > 작성:
> > > >
> > > > > Hi,
> > > > >
> > > > > The system dictionary is not a mere "word collection", it includes
a
> > > > > machine-learned language model which is carefully trained by
> > > > > researchers. If you want to replace the system dictionary, you have
to
> > > > > start from "re-train" the model. This needs expert knowledge so I
do
> > > > > not recommend to just modify the CSVs and rebuild it (if you do not
> > > > > have an expert about it).
> > > > >
> > > > > As far as relates to "modern words" which is not included the current
> > > > > system dictionary, there are already a few options.
> > > > >
> > > > > 1. Use neologd dictionary (it's an extension of MeCab IPADIC,
> > > > > Kuromoji's default dictionary)
> > > > >
> > > > > For Solr:
> > > > > https://github.com/mocobeta/lucene-solr/tree/kuromoji-neologd_5_4_0
> > > > > (The branch is mine. A little bit old, but you can cherry-pick the
> > > > > changes in the kuromoji's build.xml.)
> > > > >
> > > > > For Elasticsearch:
> > > > >
> > > https://github.com/codelibs/elasticsearch-analysis-kuromoji-ipadic-neologd
> > > > >
> > > > > 2. Use Sudachi dictionary
> > > > >
> > > > > For Elasticsearch:
> > > > > https://github.com/WorksApplications/elasticsearch-sudachi
> > > > > This includes Lucene jar, so I think you can extract the jar for
Solr
> > > > > (I've never tried to use with Solr).
> > > > >
> > > > > Both are actively maintained by linguistics & NLP
> > > researchers/engineers.
> > > > > Please be careful, those are rather huge jars...
> > > > >
> > > > > Hope that helps.
> > > > >
> > > > > Tomoko
> > > > >
> > > > > 2019年5月26日(日) 23:11 Trejkaz <trejkaz@trypticon.org>:
> > > > > >
> > > > > > On Sun, 26 May 2019 at 23:49, Namgyu Kim <kng0828@gmail.com>
wrote:
> > > > > >
> > > > > > > I think so about that approach.
> > > > > > > It's not user-friendly and it is not good for the user.
> > > > > >
> > > > > > I think it's better to get the parameters in
> > > > > >
> > > > > > JapaneseTokenizer.
> > > > > > >
> > > > > > > What do you think about this?
> > > > > >
> > > > > >
> > > > > > A way to override the system dictionary would be useful for
us as
> > > well.
> > > > > We
> > > > > > often get people complaining that the current dictionary is
missing
> > > a lot
> > > > > > of common modern words, and there are alternate mecab dictionaries
> > > > > sitting
> > > > > > around already which solve this problem.
> > > > > >
> > > > > > TX
> > > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > >
> > > > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > >
> > > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message