nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefan Groschupf (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-144) corrupt language identifier tri files and bad language recognition for german
Date Sat, 17 Dec 2005 17:01:34 GMT
    [ http://issues.apache.org/jira/browse/NUTCH-144?page=comments#action_12360668 ] 

Stefan Groschupf commented on NUTCH-144:
----------------------------------------

A good source for such documents is:
http://www.gutenberg.org/catalog/


> corrupt language identifier tri files and bad language recognition for german
> -----------------------------------------------------------------------------
>
>          Key: NUTCH-144
>          URL: http://issues.apache.org/jira/browse/NUTCH-144
>      Project: Nutch
>         Type: Improvement
>     Versions: 0.8-dev
>     Reporter: Bernhard Messer
>     Priority: Minor

>
> Hi,
> i had a look at the generated language guesser tri files. As far as i can say, several
of them (de.ngp, da.ngp, es.ngp) seems to be corrupt which leeds to bad language recognition
ratio. For example the german tri file should contain the german special characters "ä",
"ö", "ü" with their frequency. The text "grüne Hüte" which is typical german, is recognized
as danish. May be the problem comes from wrong character encoding during training.
> Jerome, could you provide the training files so that the language identifier can be retrained
?
> regards
>  Bernhard

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message