nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nikos Mastropavlos (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-946) cache.jsp does not recognize encoding conversion from content different to UTF-8
Date Mon, 07 Mar 2011 07:11:59 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13003281#comment-13003281
] 

Nikos Mastropavlos commented on NUTCH-946:
------------------------------------------

Having tried this on some Greek websites with encoding Windows-1253, the correct meta name
seems to be "Content-Encoding" instead of "CharEncodingForConversion". So, using the patch
described above and adding a 
    if (encoding==null) encoding = (String) parseMetaData.get("Content-Encoding");
right after the CharEncodingForConversion search, seemed to do the trick for me.


> cache.jsp does not recognize encoding conversion from content different to UTF-8
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-946
>                 URL: https://issues.apache.org/jira/browse/NUTCH-946
>             Project: Nutch
>          Issue Type: Bug
>          Components: web gui
>    Affects Versions: 1.2
>         Environment: Server version: Apache Tomcat/6.0.29
> Server built:   July 19 2010 1458
> Server number:  6.0.0.29
> OS Name:        Linux
> OS Version:     2.6.18-128.7.1.el5
> Architecture:   i386
> JVM Version:    1.6.0_22-b04
> JVM Vendor:     Sun Microsystems Inc.
>            Reporter: Enrique Berlanga
>            Priority: Minor
>         Attachments: cache-946.patch
>
>
> Cache view does not recognize encoding conversion needed to show properly page content
stored in a segment.
> The problem is that it searchs "CharEncodingForConversion" meta in content metadata,
but it's stored in parse metadata.
> Here is the patch I've generated for the fixed version:
> ### Eclipse Workspace Patch 1.0
> #P branch-1.2
> Index: src/web/jsp/cached.jsp
> ===================================================================
> --- src/web/jsp/cached.jsp	(revision 1027060)
> +++ src/web/jsp/cached.jsp	(working copy)
> @@ -39,17 +39,18 @@
>      ResourceBundle.getBundle("org.nutch.jsp.cached", request.getLocale())
>      .getLocale().getLanguage();
>  
> -  Metadata metaData = bean.getParseData(details).getContentMeta();
> +  Metadata contentMetaData = bean.getParseData(details).getContentMeta();
> +  Metadata parseMetaData = bean.getParseData(details).getParseMeta();
>  
>    String content = null;
> -  String contentType = (String) metaData.get(Metadata.CONTENT_TYPE);
> +  String contentType = (String) contentMetaData.get(Metadata.CONTENT_TYPE);
>    if (contentType.startsWith("text/html")) {
>      // FIXME : it's better to emit the original 'byte' sequence 
>      // with 'charset' set to the value of 'CharEncoding',
>      // but I don't know how to emit 'byte sequence' in JSP.
>      // out.getOutputStream().write(bean.getContent(details)) may work, 
>      // but I'm not sure.
> -    String encoding = (String) metaData.get("CharEncodingForConversion"); 
> +    String encoding = (String) parseMetaData.get("CharEncodingForConversion"); 
>      if (encoding != null) {
>        try {
>          content = new String(bean.getContent(details), encoding);

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message