lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl (JIRA) <j...@apache.org>
Subject [jira] [Commented] (LUCENE-6736) SmartChineseAnalyzer chops English tokens in a chinese-english mixed sentence.
Date Fri, 14 Aug 2015 06:39:45 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-6736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696573#comment-14696573
] 

Jan Høydahl commented on LUCENE-6736:
-------------------------------------

What you are seeing is the effect of stemming, which is expected. Please bring questions up
on the users list before opening a bug ticket.

> SmartChineseAnalyzer chops English tokens in a chinese-english mixed sentence.
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-6736
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6736
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 5.1
>         Environment: linux Java 1.7
>            Reporter: Wayne Xin
>              Labels: chinese, tokenization
>
> I am new with Lucene Analyzer. The following code has predefined the sentence in "testStr":
> 		String testStr = "女单方面,王适娴second seed和头号种子卫冕冠军西班牙选手马林first
seed同处1/4区,3号种子李雪芮和韩国选手Korean player成池铉处在2/4区,不过成池铉先要过日本小将(Japanese
player)奥原希望这关。下半区,6号种子王仪涵若想晋级决赛secure position.
congratulations.";
>  The printed tokenized result is:
>  女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选手
马 林 first seed 同 处 1 4 区 3 号 种子 李 雪 芮 和 韩国 选手 korean player
成 池 铉 处在 2 4 区 不过 成 池 铉 先 要 过 日本 小将 japanes player 奥
原 希望 这 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 决赛 secur posit
congratul
> As you can see some long English tokens such as Japanese, position and congratulations
are cut short in the tokenization process. I hope I didn't use it wrong.
> Test code:
> 	private static void testChineseTokenizer() {
> 		String testStr = "女单方面,王适娴second seed和头号种子卫冕冠军西班牙选手马林first
seed同处1/4区,3号种子李雪芮和韩国选手Korean player成池铉处在2/4区,不过成池铉先要过日本小将(Japanese
player)奥原希望这关。下半区,6号种子王仪涵若想晋级决赛secure position.
congratulations.";
> 		Analyzer analyzer = new SmartChineseAnalyzer();
>         List<String> result = new ArrayList<String>();
>         StringReader sr = new StringReader(testStr);
>         try {
>             TokenStream stream  = analyzer.tokenStream(null,sr);
>             CharTermAttribute cattr = stream.addAttribute(CharTermAttribute.class);
>             stream.reset();
>            while (stream.incrementToken()) {
>                 String token = cattr.toString();
>                 result.add(token);
>             }
>             stream.end();
>             stream.close();
>             sr.close();
>             analyzer.close();
>             stream = null;
>             for (String tok: result) {
>             	System.out.print(" " + tok);
>             }
>             System.out.println();
>         }
>         catch(IOException e) {
>             // not thrown b/c we're using a string reader...
>         }
> 	}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message