nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ferdy (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1039) Fetcher fails for pages without content-length header
Date Thu, 01 Sep 2011 08:30:09 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13095182#comment-13095182
] 

Ferdy commented on NUTCH-1039:
------------------------------

This error is definitely caused by the server only incidentally returning an *empty* contentlength.
I know this because we had the same issues with nu.nl before and this is actually the reason
for me to open issue NUTCH-1096.

To conclude, there are 3 cases:

A) Server returns a valid contentlength: Integer is parsed and all goes well.
B) Server returns no contentlength: No integer is parsed, instead contentlength is set to
Integer.MAX_VALUE (of course it is still limited by http.content.limit). Fetching will continue
as normal.
C) Server returns an invalid contentlength, whether it be an empty string or just plain garbage.
This will result in an NumberFormatException followed by a HttpException. 

Your case is C, because *org.apache.nutch.protocol.http.api.HttpException: bad content length:*
indicates an empty contentlength.

To allow the cases with an empty string to proceed as normal I created the patch in NUTCH-1096.
Therefore this issue is somewhat of a duplicate of NUTCH-1096. However I propose to close
this issue as the title of this issue indicates that it is about case B. (But as mentioned
before this really was not an issue in the first place.)

> Fetcher fails for pages without content-length header
> -----------------------------------------------------
>
>                 Key: NUTCH-1039
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1039
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>             Fix For: 1.4, 2.0
>
>
> Fetcher fails:
> 2011-07-11 14:45:34,764 ERROR http.Http - org.apache.nutch.protocol.http.api.HttpException:
bad content length:
> 2011-07-11 14:45:34,765 ERROR http.Http - at org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:218)
> 2011-07-11 14:45:34,765 ERROR http.Http - at org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:158)
> 2011-07-11 14:45:34,765 ERROR http.Http - at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
> 2011-07-11 14:45:34,765 ERROR http.Http - at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:138)
> 2011-07-11 14:45:34,765 ERROR http.Http - at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:79)
> Both fetcher and indexing filter checker fail sometimes. I'm unsure whether this is something
in Nutch or whether the remote server only returns content-length incidentally.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message