nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dominic Xu (JIRA)" <j...@apache.org>
Subject [jira] Created: (NUTCH-968) Crawling - File Error 404 when fetching file with an chinese word in the file name
Date Fri, 04 Mar 2011 01:20:36 GMT
Crawling - File Error 404 when fetching file with an chinese word in the file name 
-----------------------------------------------------------------------------------

                 Key: NUTCH-968
                 URL: https://issues.apache.org/jira/browse/NUTCH-968
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 1.2
         Environment: CentOS 5.4 with zh_CN.UTF8
            Reporter: Dominic Xu


I am performing a local file system crawling.
My problem is the following: all files that contain some chinese words in the file name do
not get crawled.
example:
fetching  /mnt/中文.txt

I will get the error :org.apache.nutch.protocol.file.FileError: File Error: 404.

and I read ISSUE NUTCH-824 https://issues.apache.org/jira/browse/NUTCH-824
and I patch with trunk : Committed revision 1056394.

but the bug no fix.

I fix the problem by modifying  the file : src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/FileResponse.java


262    for (int i=0; i<list.length; i++) {
263      f = list[i];
264      String name = f.getName();
265 +try {
266 +      // specify the encoding via the config later?
267 +      name = java.net.URLEncoder.encode(name, "UTF-8");
268 +    } catch (UnsupportedEncodingException ex) {
269 +    }
270 +
271 String time = HttpDateFormat.toString(f.lastModified());

There is must encode by utf8.

and I modify the content with meta tag.
251- StringBuffer x = new StringBuffer("<html><head>");
251+ StringBuffer x = new StringBuffer("<html><head><meta http-equiv=\"Content-Type\"
content=\"text/html; charset=utf-8\" />");



 

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message