lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Namgyu Kim <kng0...@gmail.com>
Subject Re: JapaneseAnalyzer's system vs user dict
Date Tue, 28 May 2019 15:46:18 GMT
Hi Tomoko :D

Thank you for your reply and listening to my thinking.
And I didn't know this question is old.
Of course, I want to participate in the LUCENE-8816 issue.

I think this issue will take some time.
I'll check it.

Warm regards,
Namgyu Kim


On Tue, May 28, 2019 at 10:43 PM Tomoko Uchida <tomoko.uchida.1111@gmail.com>
wrote:

> Hi guys,
>
> I just created an issue related to this thread.
>
> Decouple Kuromoji's morphological analyser and its dictionary
> https://issues.apache.org/jira/browse/LUCENE-8816
>
> The problem discussed here is essentially within the current
> architecture of Kuromoji (and Nori), "jar bundled system dictionary".
> So, the most natural solution is decoupling the Viterbi logic and the
> encoded dictionary (just as traditional Japanese morphological
> analysis engines do so).
> This is actually old question with respect to kuromoji, however I feel
> like that it's a good time to re-think it.
>
> It will take time (and to be honest I'm not sure the patch will be
> accepted) but I think it's much better than applying monkey-fixes to
> the current build script.
> If you are seriously interested in this work, please feel free to involve
> it.
>
> Tomoko
>
> 2019年5月28日(火) 7:57 Tomoko Uchida <tomoko.uchida.1111@gmail.com>:
> >
> > Hi Namgyu,
> >
> > > There is a team that uses a well-ported system dictionary.
> > > The Lucene version is up. (like 8.1 -> 8.2)
> > > Suppose there was no modification to kuromoji in 8.2.
> > > But the user has to port again.
> > > The same goes for 8.2 to 8.3.
> >
> > I'm not sure about the situation at Korea, however, we also have some
> > frequently updated, well-maintained (by NLP professionals) system
> > dictionaries.
> > 1. neologd (mecab ipadic extension) and 2. Sudachi (unidic extension
> > partially including neologd) I mentioned in my previous mail.
> > I agree with that it's a labor to re-build the tokenizer every time
> > when upgrading.
> >
> > In both case, some outstanding contributors build and distribute
> > plugins including up-to-date dictionary at a constant pace, and other
> > users just use them. Seems this works greatly at least in Japan, for
> > now.
> > Maybe we can start from outside of Lucene project such like that? If
> > the workflow works well and it's really needed, developers can propose
> > the change (a patch for the build script, and possibly the system
> > dictionary operation or update policy is also needed) to the Jira
> > anytime.
> >
> > I know that current JapaneseAnalyzer's system dictionary (MeCab
> > IPADIC) has been not maintained for ten years and developers/users
> > often complain about it.
> > For now I just see the effort of the developers community (including
> > me) to try to find good solutions for that.
> >
> > Thanks,
> > Tomoko
> >
> > 2019年5月28日(火) 2:42 Namgyu Kim <kng0828@gmail.com>:
> > >
> > > Thank you for your reply, Tomoko :D
> > >
> > > To be honest, I have not experienced it directly(means commercialize),
> so I
> > > can't tell the exact situation of the Japanese MeCab.
> > > I respect your opinion and it is true that customization is a difficult
> > > task.
> > >
> > > But I can talk a little bit about Korean MeCab. (The basic logic is the
> > > same)
> > > In the case of Hangul MeCab, system dictionary changes are very
> frequent.
> > > Developers do not design the engine from the bottom, so they tend to
> try a
> > > lot of tuning at some level. (like custom model, score matrix, custom
> > > dictionary)
> > > Especially in commercialization, developers make a lot of tuning to
> make
> > > the dictionary that is the most suitable for the purpose.
> > > (Of course, the big tech companies use their own analyzers :D)
> > >
> > > MeCab is especially popular in Korea, so there are many attempts.
> > > Developers often port it to Elasticsearch and use a lot, but they have
> to
> > > do a lot of boring work every time.
> > > (It is not Korean MeCab case, but I think Mike and Trejkaz talked in
> that
> > > sense)
> > >
> > > There is another bad case.
> > >
> > > There is a team that uses a well-ported system dictionary.
> > > The Lucene version is up. (like 8.1 -> 8.2)
> > > Suppose there was no modification to kuromoji in 8.2.
> > > But the user has to port again.
> > > The same goes for 8.2 to 8.3.
> > > Even if kuromoji has a fix that is not associated with Dictionary, the
> user
> > > has to port each time.
> > >
> > > At least if we allow them to read custom dat files, these problems can
> be
> > > disappeared.
> > >
> > > Warm regards,
> > > Namgyu Kim
> > >
> > > On Mon, May 27, 2019 at 8:21 AM Tomoko Uchida <
> tomoko.uchida.1111@gmail.com>
> > > wrote:
> > >
> > > > > Anyway, in my personal opinion, Lucene does not need to consider
> whether
> > > > the system dictionary status is good or not.
> > > >
> > > > Please don't get me wrong, but I don't think so.
> > > > Creating a customized or re-trained system dictionary still needs
> deep
> > > > knowledge about language and machine-learning. Even among in us,
> > > > native Japanese, very few people can do so.
> > > > The system dictionary is a key component for tokenization, so badly
> > > > customized system dictionary directly affects to the search quality
> > > > and I think we should prevent it. Instead of messing up the system
> > > > dictionary without sufficient knowledge, please use the user
> > > > dictionary. That is the reason why it exists.
> > > >
> > > > Anyway building the system dictionary (MeCab IPADIIC extensions), you
> > > > do not need read or fix the DictionaryBuilder class.
> > > > Just modify analysis/kuromoji/build.xml to use the
> > > > customized/re-trained dictionary (tar ball).
> > > >
> > > > Tomoko
> > > >
> > > > 2019年5月27日(月) 1:48 Namgyu Kim <kng0828@gmail.com>:
> > > > >
> > > > > Oh, I think my explanation was not enough. Sorry...
> > > > >
> > > > > I mentioned the following sentences.
> > > > > =============================
> > > > > 1. Modify your dictionary file and rebuild.
> > > > >   1-1) Install MeCab
> > > > >   1-2) Install MeCab Dictionary
> > > > >   1-3) Modify your dictionary file
> > > > >   1-4) Make it to tar.gz
> > > > > =============================
> > > > > The "1-3)" does not mean user modifies the csv files and
> compresses it
> > > > back
> > > > > to tar.gz.
> > > > > It means re-training, of course user has to be careful and have
> knowledge
> > > > > of the Natural Language Processing.
> > > > > Column 2, 3 and 4 in csv values are the values produced by
> training.
> > > > > (2 : left context id, 3 : right context id, 4 : cost)
> > > > > These values are dependent on the model and matrix.def values.
> (when use
> > > > > mecab-dict-index)
> > > > >
> > > > > That's why I mentioned "1-1)" and "1-2)" processes first.
> > > > >
> > > > > Anyway, in my personal opinion, Lucene does not need to consider
> whether
> > > > > the system dictionary status is good or not.
> > > > > I just think when some user wants to use a custom system
> dictionary, it
> > > > is
> > > > > not user-friendly to modify the ant file or find some code for a
> long
> > > > time
> > > > > to run the DictionaryBuilder.
> > > > > I think there should be at least a guide.
> > > > >
> > > > > Warm regards,
> > > > > Namgyu Kim
> > > > >
> > > > > P.S. Although not as good as the Tomoko's contents, there is a
> list of
> > > > > dictionaries supported by kuromoji.
> > > > > (https://github.com/atilika/kuromoji#supported-dictionaries)
> > > > >
> > > > >
> > > > > 2019년 5월 27일 (월) 오전 12:12, Tomoko Uchida <
> tomoko.uchida.1111@gmail.com
> > > > >님이
> > > > > 작성:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > The system dictionary is not a mere "word collection", it
> includes a
> > > > > > machine-learned language model which is carefully trained by
> > > > > > researchers. If you want to replace the system dictionary, you
> have to
> > > > > > start from "re-train" the model. This needs expert knowledge
so
> I do
> > > > > > not recommend to just modify the CSVs and rebuild it (if you
do
> not
> > > > > > have an expert about it).
> > > > > >
> > > > > > As far as relates to "modern words" which is not included the
> current
> > > > > > system dictionary, there are already a few options.
> > > > > >
> > > > > > 1. Use neologd dictionary (it's an extension of MeCab IPADIC,
> > > > > > Kuromoji's default dictionary)
> > > > > >
> > > > > > For Solr:
> > > > > >
> https://github.com/mocobeta/lucene-solr/tree/kuromoji-neologd_5_4_0
> > > > > > (The branch is mine. A little bit old, but you can cherry-pick
> the
> > > > > > changes in the kuromoji's build.xml.)
> > > > > >
> > > > > > For Elasticsearch:
> > > > > >
> > > >
> https://github.com/codelibs/elasticsearch-analysis-kuromoji-ipadic-neologd
> > > > > >
> > > > > > 2. Use Sudachi dictionary
> > > > > >
> > > > > > For Elasticsearch:
> > > > > > https://github.com/WorksApplications/elasticsearch-sudachi
> > > > > > This includes Lucene jar, so I think you can extract the jar
for
> Solr
> > > > > > (I've never tried to use with Solr).
> > > > > >
> > > > > > Both are actively maintained by linguistics & NLP
> > > > researchers/engineers.
> > > > > > Please be careful, those are rather huge jars...
> > > > > >
> > > > > > Hope that helps.
> > > > > >
> > > > > > Tomoko
> > > > > >
> > > > > > 2019年5月26日(日) 23:11 Trejkaz <trejkaz@trypticon.org>:
> > > > > > >
> > > > > > > On Sun, 26 May 2019 at 23:49, Namgyu Kim <kng0828@gmail.com>
> wrote:
> > > > > > >
> > > > > > > > I think so about that approach.
> > > > > > > > It's not user-friendly and it is not good for the
user.
> > > > > > >
> > > > > > > I think it's better to get the parameters in
> > > > > > >
> > > > > > > JapaneseTokenizer.
> > > > > > > >
> > > > > > > > What do you think about this?
> > > > > > >
> > > > > > >
> > > > > > > A way to override the system dictionary would be useful
for us
> as
> > > > well.
> > > > > > We
> > > > > > > often get people complaining that the current dictionary
is
> missing
> > > > a lot
> > > > > > > of common modern words, and there are alternate mecab
> dictionaries
> > > > > > sitting
> > > > > > > around already which solve this problem.
> > > > > > >
> > > > > > > TX
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > > For additional commands, e-mail:
> java-user-help@lucene.apache.org
> > > > > >
> > > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message