lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tomoko Uchida <tomoko.uchida.1...@gmail.com>
Subject Re: JapaneseAnalyzer's system vs user dict
Date Sun, 26 May 2019 15:11:36 GMT
Hi,

The system dictionary is not a mere "word collection", it includes a
machine-learned language model which is carefully trained by
researchers. If you want to replace the system dictionary, you have to
start from "re-train" the model. This needs expert knowledge so I do
not recommend to just modify the CSVs and rebuild it (if you do not
have an expert about it).

As far as relates to "modern words" which is not included the current
system dictionary, there are already a few options.

1. Use neologd dictionary (it's an extension of MeCab IPADIC,
Kuromoji's default dictionary)

For Solr: https://github.com/mocobeta/lucene-solr/tree/kuromoji-neologd_5_4_0
(The branch is mine. A little bit old, but you can cherry-pick the
changes in the kuromoji's build.xml.)

For Elasticsearch:
https://github.com/codelibs/elasticsearch-analysis-kuromoji-ipadic-neologd

2. Use Sudachi dictionary

For Elasticsearch: https://github.com/WorksApplications/elasticsearch-sudachi
This includes Lucene jar, so I think you can extract the jar for Solr
(I've never tried to use with Solr).

Both are actively maintained by linguistics & NLP researchers/engineers.
Please be careful, those are rather huge jars...

Hope that helps.

Tomoko

2019年5月26日(日) 23:11 Trejkaz <trejkaz@trypticon.org>:
>
> On Sun, 26 May 2019 at 23:49, Namgyu Kim <kng0828@gmail.com> wrote:
>
> > I think so about that approach.
> > It's not user-friendly and it is not good for the user.
>
> I think it's better to get the parameters in
>
> JapaneseTokenizer.
> >
> > What do you think about this?
>
>
> A way to override the system dictionary would be useful for us as well. We
> often get people complaining that the current dictionary is missing a lot
> of common modern words, and there are alternate mecab dictionaries sitting
> around already which solve this problem.
>
> TX
>
>
> >
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message