lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] Commented: (LUCENE-1728) Move SmartChineseAnalyzer & resources to own contrib project
Date Tue, 21 Jul 2009 09:20:14 GMT


Robert Muir commented on LUCENE-1728:

Simon, I agree with you, there is a ton of work to be done. 

I also did not particularly like my method of moving everything into one package to hide the
internals... and I 100% agree that a "correct" refactoring is quite a bit of work. 

I don't want to sound like a complainer since I don't have a patch to fix these things, but
I want to list some things that I would like to fix/refactor also.
* removal of GB2312 dictionary dependency: this limits functionality to simplified chinese.
* use of unicode categories (java Character class, etc) versus Utility.getCharType()
* support for codepoints outside of BMP, this is necessary to support traditional chinese.
* a little more flexibility with tokenization, honestly I'm really not sold on indexing "words"
for chinese in the first place. But words + bigrams (overlapping tokens), that would be nice.

In the future it would be nice to add support for traditional chinese, and there is frequency
data out there (libtabe: BSD license, etc), but we need to refactor first.

As far as what to do for 2.9... I really don't know either, just let me know if you need a
new patch :)

> Move SmartChineseAnalyzer & resources to own contrib project
> ------------------------------------------------------------
>                 Key: LUCENE-1728
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: 2.9
>         Attachments: LUCENE-1728.txt, LUCENE-1728.txt, LUCENE-1728.txt
> SmartChineseAnalyzer depends on  a large dictionary that causes the analyzer jar to grow
up to 3MB. The dictionary is quite big compared to all the other resouces / class files contained
in that jar. 
> Having a separate analyzer-cn contrib project enables footprint-sensitive users (e.g.
using lucene on a mobile phone) to include analyzer.jar without getting into trouble with
disk space.
> Moving SmartChineseAnalyzer to a separate project could also include a small refactoring
as Robert mentioned in [LUCENE-1722|] several
classes should be package protected, members and classes could be final, commented syserr
and logging code should be removed etc.
> I set this issue target to 2.9 - if we can not make it until then feel free to move it
to 3.0

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message