lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Che Dong" <>
Subject Re: Analyzers for various languages
Date Tue, 31 Dec 2002 06:59:19 GMT
For asian language, Chinese Korean Japanese,  bigram based word segment is easy way to solve
the word segment problem. 
Bigram based word segment is:  C1C2C3C4  =>  C1C2 C2C3 C3C4  (C# is single CJK charator
I think the the make a StandardTokenizer can handle multi language mixed content : Chinese/English,
Japanese/French mixed content. 

In CJKTokenizer(modify from StopTokenizer) I use one char buffer remember previous CJK charactor
to make overlap term(Ci + Ci-1)。
but in StandardTokenizer I still don't know how to make:
T1T2T3T4 => T1T2 T2T3 T3T4.  (T# is single CJK charator term)

for more article on word segment for asian languages:


Che, Dong
----- Original Message ----- 
From: "Eric Isakson" <>
To: <>
Sent: Saturday, December 07, 2002 12:40 AM
Subject: Analyzers for various languages

> Hi All, 
> I want to volunteer to help get language modules organized into the CVS and builds.
> I've been lurking on the lists here for a couple months and working with and getting
familiar with Lucene. I'm investigating the use of lucene to support our help system's fulltext
search requirements. I have to build indices for multiple languages. I just poked around the
CVS archives and found only the German, Russian and standard(English) analyzers in the core
and nothing in the sandbox. In the list archives I've found many references to folks using
Lucene for several other languages. I did find the CJKTokenizer, Dutch and French analyzers
and have put those into my tests. Is there somewhere these analyzers are organized that I
might get a hold of the sources for other languages to build into my toolset? There were a
couple mentioned that several of you appear to be using that I can't find the sources for
(most notably <>
 which gives a "Cannot find server" error). 
> In order to meet the requirements for my product these are the languages I have to support:

> Must Support 
> ------------ 
> English
> Japanese 
> Chinese 
> Korean 
> French 
> German 
> Italian 
> Polish 
> Not Sure Yet 
> ------------ 
> Czech 
> Danish 
> Hebrew 
> Hungarian 
> Russian 
> Spanish 
> Swedish 
> I understand the issues that were raised about putting language modules in the core and
then not being able to support them, but it seems they have not been put anywhere. I would
be willing to try and get them into a central place that people can access them or help someone
that is already working on that. I can't commit today to being able to maintain or bugfix
contributions, but should my company adopt Lucene as our search engine (which seems likely
at this point) I'll do what I can to contribute back any fixes we make. I also have a personal
interest in the project since I've found Lucene quite interesting to be working with and I've
enjoyed learning about internationalizing java apps.
> I'll volunteer to help gather and organize these somewhere if I were given committer
rights to the appropriate area and folks would be willing to send me their language modules.

> I recall some discussion about moving language modules out of the core, but I don't think
any decisions were made about where to put them (perhaps this is why they aren't in the CVS
at all). I was thinking perhaps give each language a sandbox project or create language packages
in the core build that could be enabled via settings in the file. Using the file could allow us to create a jar for each language during the core build
so folks could install just the language modules they want and if a language module starts
breaking due to changes in the core it could easily be turned off until fixes were made to
that module. I can start working on a setup like this in my local source tree next week using
the existing language modules in the core if you all think this would be a good approach.
If not, does anyone have a proposal for where these belong so we can get some movement on
getting them committed to CVS?
> Regards,
> Eric
> -- 
> Eric D. Isakson        SAS Institute Inc. 
> Application Developer  SAS Campus Drive 
> XML Technologies       Cary, NC 27513 
> (919) 531-3639 <>  
View raw message