lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: UnicodeNormalizationFilterFactory
Date Mon, 04 Aug 2008 18:34:27 GMT
Robert, does your code do something that IUC doesn't do?  See http://www.icu-project.org/


Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Robert Haschart <rh9ec@virginia.edu>
> To: solr-user@lucene.apache.org
> Sent: Thursday, June 26, 2008 4:41:02 PM
> Subject: Re: UnicodeNormalizationFilterFactory
> 
> Lance Norskog wrote:
> 
> >ISOLatin1AccentFilterFactory works quite well for us. It solves our basic
> >euro-text keyboard searching problem, where "protege" should find protégé.
> >("protege" with two accents.)
> >
> >-----Original Message-----
> >From: Chris Hostetter [mailto:hossman_lucene@fucit.org]
> >Sent: Tuesday, June 24, 2008 4:05 PM
> >To: solr-user@lucene.apache.org
> >Subject: Re: UnicodeNormalizationFilterFactory
> >
> >
> >: I've seen mention of these filters:
> >:
> >:  
> >:  
> >
> >Are you asking because you saw these in Robert Haschart's reply to your
> >previous question?  I think those are custom Filters that he has in his
> >project ... not open source (but i may be wrong)
> >
> >they are certainly not something that comes out of the box w/ Solr.
> >
> >
> >-Hoss
> >  
> >
> The ISOLatin1AccentFilter works well in the case above described by 
> Lance Norskog, ie. for words containing characters with accents where 
> the accented character is a single unicode character for the letter with 
> the accent mark as in protégé. However in the data that we work with, 
> often accented characters will be represented by a plain unaccented 
> character followed by the Unicode combining character for the accent 
> mark, roughly like this: prote'ge' which emerge from the 
> ISOLatin1AccentFilter unchanged.
> 
> After some research I found the UnicodeNormalizationFilter mentioned 
> above, which did not work on my development system (because it relies 
> features only available in java 6), and which when combined with the 
> DiacriticsFilter also mentioned above would remove diacritics from 
> characters, but also discard any Chinese characters or Russian 
> characters, or anything else outside the 0x0--0x7f range. Which is bad.
> 
> I first modified the filter to normalize the characters to the composed 
> normalized form, (changing prote'ge' to protégé) and then pass the 
> results through the ISOLatin1AccentFilter. However for accented 
> characters for which there is no composed normailzed form (such as the n 
> and s in Zarin̦š) the accents are not removed.
> 
> So I took the approach of decomposing the accented characters, and then 
> only removing the valid diacritics and zero-width composing characters 
> from the result, and the resulting filter works quite well. And since it 
> was developed as a part of the blacklight project at the University of 
> Virginia it is Open Source under the Apache License.
> 
> If anyone is interested in evaluating of using the 
> UnicodeNormalizationFilter in conjunction with their Solr installation 
> get the UnicodeNormalizeFilter.jar from:
> 
> http://blacklight.rubyforge.org/svn/trunk/solr/lib/
> 
> and place it in a lib directory next to the conf directory in your Solr 
> home directory.
> 
> Robert Haschart


Mime
View raw message