nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jérôme Charron <>
Subject Re: LanguageIdentifier refactoring
Date Thu, 07 Jul 2005 13:38:11 GMT
> Mhm. I'm not so sure. The NGramProfile load/save methods are safe, they
> both use UTF-8. LanguageIdentifier.identify() seems to be safe, too -
> because it only works with Strings, which are not encoded (native
> Unicode). So, the only place where it would be problematic seems to be
> in the command-line utilities (main methods in both classes), where
> simple change to use InputStreamReader(inputstream, encoding) would fix
> the issue...

In fact, what I see while looking at the code (correct me if I'm wrong) is 
that the Writers and Readers used by Nutch don't take care of the encoding 
(only the HtmlParser performs some encoding detection and add some meta-data 
about encoding).
So, my idea is simply to:
1. Move the encoding detection used in HtmlParser in a more generic place 
(ParseSegment could be a good candidate)
2. Uses the encoding MetaData in all the Read/Write related methods

Seems to be a huge work... but I think it is necessary... no?



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message