nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: LanguageIdentifier refactoring
Date Tue, 05 Jul 2005 17:33:43 GMT
Jérôme Charron wrote:

> I think, this is an issue for all detection mechanisms...
> For the content-type it is the same problem: What is the right value, the 
> one provided by the protocol layer, or the one provided by the extension 
> mapping, or the one provided by the detection (mime-magic)?
> 
> I think, we need to change the actual process, to use auto-detection 
> mechanisms (this is true at least for code that use the language-identifier 
> and the code that use the mime-type identifier). Instead of doing someting 
> like:
> 
> 1. Get info from protocol
> 2. If no info from protocol, get info from parsing
> 3. If no info from parsing, get info from auto-detection
> 
> We need to do something like:
> 
> 1. Get info from protocol
> 2. Get info from parsing
> 3. Get degree of confidences from auto-detection, and checks:
> 3.1 Extracted value from protocol has a high degree of confidence. Take the 
> protocol value
> 3.2 Extracted value from parsing has a high degree of confidence. Take the 
> parsing value
> 3.3 None has a high degree of confidence, but the auto-detection returns 
> another value with a high degree of confidence. Take the auto-detection 
> value.
> 3.4 Take a default value 

Yes, I agree.

>>* modify the identify() method to return a pair of lang code + relative
>>score (normalized to 0..1)
> 
> 
> What do you think about returning a sorted array of lang/score pair?

Yes, that would make sense too. I've been working with a proprietary 
language detection tool (based on similar principles), and it was also 
returning a sorted array.

> For information, there's some other issues on the language-identifier:
> I was focused on performance and precision, and now, that I run it outside 
> of the "lab", and performs some tests in real life, with real documents, I 
> saw a very big issue : The LanguageIdentifierPlugin is UTF-8 oriented !!!
> I discovered this issue and analyze it yesterday: With UTF-8 encoded input 
> documents, you get some very fine identification, but with another encoding 
> it is a disaster.

Mhm. I'm not so sure. The NGramProfile load/save methods are safe, they 
both use UTF-8. LanguageIdentifier.identify() seems to be safe, too - 
because it only works with Strings, which are not encoded (native 
Unicode). So, the only place where it would be problematic seems to be 
in the command-line utilities (main methods in both classes), where 
simple change to use InputStreamReader(inputstream, encoding) would fix 
the issue...

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Mime
View raw message