lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tomoko Uchida <>
Subject Re: JapaneseAnalyzer's system vs user dict
Date Sun, 26 May 2019 15:11:36 GMT

The system dictionary is not a mere "word collection", it includes a
machine-learned language model which is carefully trained by
researchers. If you want to replace the system dictionary, you have to
start from "re-train" the model. This needs expert knowledge so I do
not recommend to just modify the CSVs and rebuild it (if you do not
have an expert about it).

As far as relates to "modern words" which is not included the current
system dictionary, there are already a few options.

1. Use neologd dictionary (it's an extension of MeCab IPADIC,
Kuromoji's default dictionary)

For Solr:
(The branch is mine. A little bit old, but you can cherry-pick the
changes in the kuromoji's build.xml.)

For Elasticsearch:

2. Use Sudachi dictionary

For Elasticsearch:
This includes Lucene jar, so I think you can extract the jar for Solr
(I've never tried to use with Solr).

Both are actively maintained by linguistics & NLP researchers/engineers.
Please be careful, those are rather huge jars...

Hope that helps.


2019年5月26日(日) 23:11 Trejkaz <>:
> On Sun, 26 May 2019 at 23:49, Namgyu Kim <> wrote:
> > I think so about that approach.
> > It's not user-friendly and it is not good for the user.
> I think it's better to get the parameters in
> JapaneseTokenizer.
> >
> > What do you think about this?
> A way to override the system dictionary would be useful for us as well. We
> often get people complaining that the current dictionary is missing a lot
> of common modern words, and there are alternate mecab dictionaries sitting
> around already which solve this problem.
> TX
> >
> >
> >

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message