nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (Jira)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2773) SegmentReader (-dump or -get): show HTML content as UTF-8
Date Fri, 13 Mar 2020 09:09:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17058535#comment-17058535
] 

ASF GitHub Bot commented on NUTCH-2773:
---------------------------------------

sebastian-nagel commented on pull request #501: NUTCH-2773 SegmentReader (-dump or -get):
show HTML content as UTF-8
URL: https://github.com/apache/nutch/pull/501
 
 
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> SegmentReader (-dump or -get): show HTML content as UTF-8
> ---------------------------------------------------------
>
>                 Key: NUTCH-2773
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2773
>             Project: Nutch
>          Issue Type: Improvement
>          Components: segment
>    Affects Versions: 1.16
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.17
>
>
> SegmentReader dumps resp. the output shown by -get is first converted to Java strings
and then shown using UTF-8 as output encoding. The HTML page content is hold by the container
class "Content" as byte[] and if another charset than UTF-8 is used as original page encoding,
the output of SegmentReader may look flawed. The reader could use the encoding already detected
by the parser (if available) and try to properly recode the HTML page content to UTF-8.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message