lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kazuaki Hiraga (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (LUCENE-4056) Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary
Date Wed, 24 Apr 2019 13:00:06 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825122#comment-16825122
] 

Kazuaki Hiraga edited comment on LUCENE-4056 at 4/24/19 1:00 PM:
-----------------------------------------------------------------

I agree with [~Tomoko Uchida] and I believe that UniDis is more suitable for Japanese full-text
information retrieval since the dictionary is well maintained by researchers of Japanese
government funded institute and it applies stricter rules than IPA dictionary that intends
to produce consistent tokenization results. 


was (Author: h.kazuaki):
I agree with [~Tomoko Uchida] and I believe that UniDis is more suitable for Japanese full-text
information retrieval since the dictionary is well maintained by researchers of Japanese
government funded institute and applies stricter rules than IPAdictionary that intend to produce
consistent tokenization results. 

> Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary
> ------------------------------------------------------------
>
>                 Key: LUCENE-4056
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4056
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.6
>         Environment: Solr 3.6
> UniDic 1.3.12 for MeCab (unidic-mecab1312src.tar.gz)
>            Reporter: Kazuaki Hiraga
>            Priority: Major
>
> I tried to build a UniDic dictionary for using it along with Kuromoji on Solr 3.6. I
think UniDic is a good dictionary than IPA dictionary, so Kuromoji for Lucene/Solr should
support UniDic dictionary as standalone Kuromoji does.
> The following is my procedure:
> Modified build.xml under lucene/contrib/analyzers/kuromoji directory and run 'ant build-dict',
I got the error as the below.
> build-dict:
>      [java] dictionary builder
>      [java] 
>      [java] dictionary format: UNIDIC
>      [java] input directory: /home/kazu/Work/src/solr/brunch_3_6/lucene/build/contrib/analyzers/kuromoji/unidic-mecab1312src
>      [java] output directory: /home/kazu/Work/src/solr/brunch_3_6/lucene/contrib/analyzers/kuromoji/src/resources
>      [java] input encoding: utf-8
>      [java] normalize entries: false
>      [java] 
>      [java] building tokeninfo dict...
>      [java]   parse...
>      [java]   sort...
>      [java] Exception in thread "main" java.lang.AssertionError
>      [java]   encode...
>      [java] 	at org.apache.lucene.analysis.ja.util.BinaryDictionaryWriter.put(BinaryDictionaryWriter.java:113)
>      [java] 	at org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.buildDictionary(TokenInfoDictionaryBuilder.java:141)
>      [java] 	at org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.build(TokenInfoDictionaryBuilder.java:76)
>      [java] 	at org.apache.lucene.analysis.ja.util.DictionaryBuilder.build(DictionaryBuilder.java:37)
>      [java] 	at org.apache.lucene.analysis.ja.util.DictionaryBuilder.main(DictionaryBuilder.java:82)
> And the diff of build.xml:
> ===================================================================
> --- build.xml	(revision 1338023)
> +++ build.xml	(working copy)
> @@ -28,19 +28,31 @@
>    <property name="maven.dist.dir" location="../../../dist/maven" />
>  
>    <!-- default configuration: uses mecab-ipadic -->
> +  <!--
>    <property name="ipadic.version" value="mecab-ipadic-2.7.0-20070801" />
>    <property name="dict.src.file" value="${ipadic.version}.tar.gz" />
>    <property name="dict.url" value="http://mecab.googlecode.com/files/${dict.src.file}"/>
> +  -->
>  
>    <!-- alternative configuration: uses mecab-naist-jdic
>    <property name="ipadic.version" value="mecab-naist-jdic-0.6.3b-20111013" />
>    <property name="dict.src.file" value="${ipadic.version}.tar.gz" />
>    <property name="dict.url" value="http://sourceforge.jp/frs/redir.php?m=iij&amp;f=/naist-jdic/53500/${dict.src.file}"/>
>    -->
> -  
> +
> +  <!-- alternative configuration: uses UniDic -->
> +  <property name="ipadic.version" value="unidic-mecab1312src" />
> +  <property name="dict.src.file" value="unidic-mecab1312src.tar.gz" />
> +  <property name="dict.loc.dir" value="/home/kazu/Work/src/nlp/unidic/_archive"/>
> +
>    <property name="dict.src.dir" value="${build.dir}/${ipadic.version}" />
> +  <!--
>    <property name="dict.encoding" value="euc-jp"/>
>    <property name="dict.format" value="ipadic"/>
> +  -->
> +  <property name="dict.encoding" value="utf-8"/>
> +  <property name="dict.format" value="unidic"/>
> +
>    <property name="dict.normalize" value="false"/>
>    <property name="dict.target.dir" location="./src/resources"/>
>  
> @@ -58,7 +70,8 @@
>  
>    <target name="compile-core" depends="jar-analyzers-common, common.compile-core"
/>
>    <target name="download-dict" unless="dict.available">
> -     <get src="${dict.url}" dest="${build.dir}/${dict.src.file}"/>
> +     <!-- get src="${dict.url}" dest="${build.dir}/${dict.src.file}"/ -->
> +     <copy file="${dict.loc.dir}/${dict.src.file}" tofile="${build.dir}/${dict.src.file}"/>
>       <gunzip src="${build.dir}/${dict.src.file}"/>
>       <untar src="${build.dir}/${ipadic.version}.tar" dest="${build.dir}"/>
>    </target>



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message