lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Namgyu Kim <kng0...@gmail.com>
Subject Re: JapaneseAnalyzer's system vs user dict
Date Sun, 26 May 2019 16:48:04 GMT
Oh, I think my explanation was not enough. Sorry...

I mentioned the following sentences.
=============================
1. Modify your dictionary file and rebuild.
  1-1) Install MeCab
  1-2) Install MeCab Dictionary
  1-3) Modify your dictionary file
  1-4) Make it to tar.gz
=============================
The "1-3)" does not mean user modifies the csv files and compresses it back
to tar.gz.
It means re-training, of course user has to be careful and have knowledge
of the Natural Language Processing.
Column 2, 3 and 4 in csv values are the values produced by training.
(2 : left context id, 3 : right context id, 4 : cost)
These values are dependent on the model and matrix.def values. (when use
mecab-dict-index)

That's why I mentioned "1-1)" and "1-2)" processes first.

Anyway, in my personal opinion, Lucene does not need to consider whether
the system dictionary status is good or not.
I just think when some user wants to use a custom system dictionary, it is
not user-friendly to modify the ant file or find some code for a long time
to run the DictionaryBuilder.
I think there should be at least a guide.

Warm regards,
Namgyu Kim

P.S. Although not as good as the Tomoko's contents, there is a list of
dictionaries supported by kuromoji.
(https://github.com/atilika/kuromoji#supported-dictionaries)


2019년 5월 27일 (월) 오전 12:12, Tomoko Uchida <tomoko.uchida.1111@gmail.com>님이
작성:

> Hi,
>
> The system dictionary is not a mere "word collection", it includes a
> machine-learned language model which is carefully trained by
> researchers. If you want to replace the system dictionary, you have to
> start from "re-train" the model. This needs expert knowledge so I do
> not recommend to just modify the CSVs and rebuild it (if you do not
> have an expert about it).
>
> As far as relates to "modern words" which is not included the current
> system dictionary, there are already a few options.
>
> 1. Use neologd dictionary (it's an extension of MeCab IPADIC,
> Kuromoji's default dictionary)
>
> For Solr:
> https://github.com/mocobeta/lucene-solr/tree/kuromoji-neologd_5_4_0
> (The branch is mine. A little bit old, but you can cherry-pick the
> changes in the kuromoji's build.xml.)
>
> For Elasticsearch:
> https://github.com/codelibs/elasticsearch-analysis-kuromoji-ipadic-neologd
>
> 2. Use Sudachi dictionary
>
> For Elasticsearch:
> https://github.com/WorksApplications/elasticsearch-sudachi
> This includes Lucene jar, so I think you can extract the jar for Solr
> (I've never tried to use with Solr).
>
> Both are actively maintained by linguistics & NLP researchers/engineers.
> Please be careful, those are rather huge jars...
>
> Hope that helps.
>
> Tomoko
>
> 2019年5月26日(日) 23:11 Trejkaz <trejkaz@trypticon.org>:
> >
> > On Sun, 26 May 2019 at 23:49, Namgyu Kim <kng0828@gmail.com> wrote:
> >
> > > I think so about that approach.
> > > It's not user-friendly and it is not good for the user.
> >
> > I think it's better to get the parameters in
> >
> > JapaneseTokenizer.
> > >
> > > What do you think about this?
> >
> >
> > A way to override the system dictionary would be useful for us as well.
> We
> > often get people complaining that the current dictionary is missing a lot
> > of common modern words, and there are alternate mecab dictionaries
> sitting
> > around already which solve this problem.
> >
> > TX
> >
> >
> > >
> > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message