lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tomoko Uchida <tomoko.uchida.1...@gmail.com>
Subject Re: JapaneseAnalyzer's system vs user dict
Date Sun, 26 May 2019 23:21:00 GMT
> Anyway, in my personal opinion, Lucene does not need to consider whether
the system dictionary status is good or not.

Please don't get me wrong, but I don't think so.
Creating a customized or re-trained system dictionary still needs deep
knowledge about language and machine-learning. Even among in us,
native Japanese, very few people can do so.
The system dictionary is a key component for tokenization, so badly
customized system dictionary directly affects to the search quality
and I think we should prevent it. Instead of messing up the system
dictionary without sufficient knowledge, please use the user
dictionary. That is the reason why it exists.

Anyway building the system dictionary (MeCab IPADIIC extensions), you
do not need read or fix the DictionaryBuilder class.
Just modify analysis/kuromoji/build.xml to use the
customized/re-trained dictionary (tar ball).

Tomoko

2019年5月27日(月) 1:48 Namgyu Kim <kng0828@gmail.com>:
>
> Oh, I think my explanation was not enough. Sorry...
>
> I mentioned the following sentences.
> =============================
> 1. Modify your dictionary file and rebuild.
>   1-1) Install MeCab
>   1-2) Install MeCab Dictionary
>   1-3) Modify your dictionary file
>   1-4) Make it to tar.gz
> =============================
> The "1-3)" does not mean user modifies the csv files and compresses it back
> to tar.gz.
> It means re-training, of course user has to be careful and have knowledge
> of the Natural Language Processing.
> Column 2, 3 and 4 in csv values are the values produced by training.
> (2 : left context id, 3 : right context id, 4 : cost)
> These values are dependent on the model and matrix.def values. (when use
> mecab-dict-index)
>
> That's why I mentioned "1-1)" and "1-2)" processes first.
>
> Anyway, in my personal opinion, Lucene does not need to consider whether
> the system dictionary status is good or not.
> I just think when some user wants to use a custom system dictionary, it is
> not user-friendly to modify the ant file or find some code for a long time
> to run the DictionaryBuilder.
> I think there should be at least a guide.
>
> Warm regards,
> Namgyu Kim
>
> P.S. Although not as good as the Tomoko's contents, there is a list of
> dictionaries supported by kuromoji.
> (https://github.com/atilika/kuromoji#supported-dictionaries)
>
>
> 2019년 5월 27일 (월) 오전 12:12, Tomoko Uchida <tomoko.uchida.1111@gmail.com>님이
> 작성:
>
> > Hi,
> >
> > The system dictionary is not a mere "word collection", it includes a
> > machine-learned language model which is carefully trained by
> > researchers. If you want to replace the system dictionary, you have to
> > start from "re-train" the model. This needs expert knowledge so I do
> > not recommend to just modify the CSVs and rebuild it (if you do not
> > have an expert about it).
> >
> > As far as relates to "modern words" which is not included the current
> > system dictionary, there are already a few options.
> >
> > 1. Use neologd dictionary (it's an extension of MeCab IPADIC,
> > Kuromoji's default dictionary)
> >
> > For Solr:
> > https://github.com/mocobeta/lucene-solr/tree/kuromoji-neologd_5_4_0
> > (The branch is mine. A little bit old, but you can cherry-pick the
> > changes in the kuromoji's build.xml.)
> >
> > For Elasticsearch:
> > https://github.com/codelibs/elasticsearch-analysis-kuromoji-ipadic-neologd
> >
> > 2. Use Sudachi dictionary
> >
> > For Elasticsearch:
> > https://github.com/WorksApplications/elasticsearch-sudachi
> > This includes Lucene jar, so I think you can extract the jar for Solr
> > (I've never tried to use with Solr).
> >
> > Both are actively maintained by linguistics & NLP researchers/engineers.
> > Please be careful, those are rather huge jars...
> >
> > Hope that helps.
> >
> > Tomoko
> >
> > 2019年5月26日(日) 23:11 Trejkaz <trejkaz@trypticon.org>:
> > >
> > > On Sun, 26 May 2019 at 23:49, Namgyu Kim <kng0828@gmail.com> wrote:
> > >
> > > > I think so about that approach.
> > > > It's not user-friendly and it is not good for the user.
> > >
> > > I think it's better to get the parameters in
> > >
> > > JapaneseTokenizer.
> > > >
> > > > What do you think about this?
> > >
> > >
> > > A way to override the system dictionary would be useful for us as well.
> > We
> > > often get people complaining that the current dictionary is missing a lot
> > > of common modern words, and there are alternate mecab dictionaries
> > sitting
> > > around already which solve this problem.
> > >
> > > TX
> > >
> > >
> > > >
> > > >
> > > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message