nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Antony Bowesman (JIRA)" <j...@apache.org>
Subject [jira] Created: (NUTCH-632) Bug in TextParser with encoding
Date Tue, 20 May 2008 02:58:55 GMT
Bug in TextParser with encoding
-------------------------------

                 Key: NUTCH-632
                 URL: https://issues.apache.org/jira/browse/NUTCH-632
             Project: Nutch
          Issue Type: Bug
          Components: indexer
    Affects Versions: 0.9.0
         Environment: Any
            Reporter: Antony Bowesman


If a Content object is created with the following Content-Type: text/plain; charset="windows-1251"

the Content object discards the charset parameter.  As a result, when the TextParser calls

String encoding = StringUtil.parseCharacterEncoding(content.getContentType());

it always gets null because the contentType stored in the Content object no longer contains
the charset string.  The code has changed a lot from 0.9, so I am not sure if this is still
a problem, but I made a fix that simply saves charset in Content with

    if (this.contentType.startsWith("text/"))
        this.charset = StringUtil.parseCharacterEncoding(contentType);

and TextParser just calls

    String encoding = content.getCharset();



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message