nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Omkar Reddy (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2557) protocol-http fails to follow redirections when an HTTP response body is invalid
Date Fri, 25 May 2018 11:32:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16490581#comment-16490581
] 

Omkar Reddy commented on NUTCH-2557:
------------------------------------

I agree, sometimes the http body of bad requests and redirects might contain some kind of
diagnostic information that might be helpful to the user. So we should store it optionally.


Can we add the property as http.content.store.3XX.404? or is it a complicated name for a property?
 

> protocol-http fails to follow redirections when an HTTP response body is invalid
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-2557
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2557
>             Project: Nutch
>          Issue Type: Sub-task
>            Reporter: Gerard Bouchar
>            Priority: Major
>
> If a server sends a redirection (3XX status code, with a Location header), protocol-http
tries to parse the HTTP response body anyway. Thus, if an error occurs while decoding the
body, the redirection is not followed and the information is lost. Browsers follow the redirection
and close the socket soon as they can.
>  * Example: this page is a redirection to its https version, with an HTTP body containing
invalidly gzip encoded contents. Browsers follow the redirection, but nutch throws an error:
>  ** [http://www.webarcelona.net/es/blog?page=2]
>  
> The HttpResponse::getContent class can already return null. I think it should at least
return null when parsing the HTTP response body fails.
> Ideally, we would adopt the same behavior as browsers, and not even try parsing the body
when the headers indicate a redirection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message