nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ysc (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (NUTCH-1736) Can't fetch page if http response header contains Transfer-Encoding:chunked
Date Mon, 17 Mar 2014 03:05:43 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13936103#comment-13936103
] 

ysc edited comment on NUTCH-1736 at 3/17/14 3:05 AM:
-----------------------------------------------------

problem:
 
fetching: http://szs.mof.gov.cn/zhengwuxinxi/zhengcefabu/201402/t20140224_1046354.html
Fetch failed with protocol status: EXCEPTION: java.io.IOException: unzipBestEffort returned
null

detail:

2014-03-12 16:48:38,031 ERROR http.Http - Failed to get protocol output
java.io.IOException: unzipBestEffort returned null
at org.apache.nutch.protocol.http.api.HttpBase.processGzipEncoded(HttpBase.java:317)
at org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:164)
at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:140)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:703)
2014-03-12 16:48:38,031 INFO  fetcher.Fetcher - fetch of http://szs.mof.gov.cn/zhengwuxinxi/zhengcefabu/201402/t20140224_1046354.html
failed with: java.io.IOException: unzipBestEffort returned null
2014-03-12 16:48:38,031 INFO  fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=0

solution:

this patch deal with http response header Transfer-Encoding:chunked

important tips: 

property http.content.limit in nutch-site.xml must greater than 0

why must greater than 0?

if property http.content.limit in nutch-site.xml is negative or 0, the chunkLen is negative
or 0 too, see the code below, you can find the code in line 277 of java source file http://svn.apache.org/repos/asf/nutch/tags/release-1.7/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java

      if ( (contentBytesRead + chunkLen) > http.getMaxContent() )
        chunkLen= http.getMaxContent() - contentBytesRead;

read one trunk has a condition: 

while (chunkBytesRead < chunkLen)

so, property http.content.limit in nutch-site.xml must greater than 0


was (Author: yangshangchuan):
problem:
 
fetching: http://szs.mof.gov.cn/zhengwuxinxi/zhengcefabu/201402/t20140224_1046354.html
Fetch failed with protocol status: EXCEPTION: java.io.IOException: unzipBestEffort returned
null

detail:

2014-03-12 16:48:38,031 ERROR http.Http - Failed to get protocol output
java.io.IOException: unzipBestEffort returned null
at org.apache.nutch.protocol.http.api.HttpBase.processGzipEncoded(HttpBase.java:317)
at org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:164)
at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:140)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:703)
2014-03-12 16:48:38,031 INFO  fetcher.Fetcher - fetch of http://szs.mof.gov.cn/zhengwuxinxi/zhengcefabu/201402/t20140224_1046354.html
failed with: java.io.IOException: unzipBestEffort returned null
2014-03-12 16:48:38,031 INFO  fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=0

solution:

this patch deal with http response header Transfer-Encoding:chunked

important tips: 

property http.content.limit in nutch-site.xml must greater than 0

> Can't fetch page if http response header contains Transfer-Encoding:chunked
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-1736
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1736
>             Project: Nutch
>          Issue Type: Bug
>          Components: protocol
>    Affects Versions: 1.6, 2.1, 1.7, 2.2, 2.3, 1.8, 2.4, 1.9, 2.2.1
>            Reporter: ysc
>            Priority: Critical
>             Fix For: 2.3, 1.9
>
>         Attachments: nutch-2.2.1.patch, nutch1.7.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> fetching: http://szs.mof.gov.cn/zhengwuxinxi/zhengcefabu/201402/t20140224_1046354.html
> Fetch failed with protocol status: EXCEPTION: java.io.IOException: unzipBestEffort returned
null



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message