nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@nutch.org>
Subject Re: Multi-Lingual support
Date Mon, 13 Jun 2005 16:35:32 GMT
Jérôme Charron wrote:
> I was thinking about it for a while: Multi-Lingual support in Nutch.
> After looking at Nutch code, I write a proposal on the Wiki (
> http://wiki.apache.org/nutch/MultiLingualSupport).

+1

I think this is a good design.

This was discussed recently on the lucene-dev mailing list.

http://www.mail-archive.com/java-user@lucene.apache.org/msg01197.html

False positives are sometimes a concern, where different words in 
different languages analyze to the same string.   For example, a French 
analyzer might normalize "thé" to "the" which would then match many 
English documents.  I think such issues are best resolved by restricting 
search results to the desired language.  (In the case of "thé", even 
this is not enough, since many French documents include bits of English. 
  So, while converting "thé" to "the" might aid French recall a bit, it 
really kills precision.  Thus "thé" is probably best analyzed as "thé", 
or perhaps as both "thé" and "the".)

Doug

Mime
View raw message