lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: Analyzers, perfect hash, ICU
Date Wed, 11 Jan 2006 18:58:17 GMT
>Hi all,
>    I'm working on the analyzer for the slovanic latin languages 
>(cs,sk) w/o stemming at first.
>I would like to ask you:
>The StopWord analyzer uses often HashSet implementation, but the the 
>Stopwords are not changed often (if ever) from shipped in the java 
>code. Do you think that is there benefit for the perfect hash 
>algorithm?

My guess is that you wouldn't save much time here using a perfect hash.

>I will do an ICU analyzer for latin chars (decompositing and return 
>base char). Have you any exp. with icu(.sf.net) some problems, 
>bottlenecks?

This could be a significant performance hit. Using ICU is a good 
idea, but typically putting some simple front-end filtering in front 
can save you a lot of time.

E.g. if there are a lot of characters that don't require any 
decomposition, you could do some quick (and very conservative) checks 
to skip calls to ICU.

But of course, measure then optimize :)

>P. S.: also I would like these stuff contribute to lucene-contrib if 
>it'll be recognized useful. Is there any  howto  set the Eclipse for 
>Lucene/Apache related project?

If you're asking about how to set up Eclipse to do development for 
Lucene, I found some posts to the mailing list a while back, but 
nothing definitive.

FWIW, my experience w/Eclipse 3.1 was that trying to auto-create 
Eclipse projects using the Ant build file didn't work very well. So 
we wound up manually creating the project, setting up the classpath, 
etc.

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message