nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2549) protocol-http does not behave the same as browsers
Date Mon, 11 Jun 2018 13:05:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16507998#comment-16507998
] 

ASF GitHub Bot commented on NUTCH-2549:
---------------------------------------

sebastian-nagel opened a new pull request #347: NUTCH-2549  protocol-http does not behave
the same as browsers
URL: https://github.com/apache/nutch/pull/347
 
 
   - integrates patch provided by Gerard Bouchar
   - fixes sub-tasks (see commit messages)
   - adds unit tests to verify that issues are solved
   
   Note: to avoid future merge conflicts this branch/PR includes code refactorings made for
NUTCH-2576.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> protocol-http does not behave the same as browsers
> --------------------------------------------------
>
>                 Key: NUTCH-2549
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2549
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.14
>            Reporter: Gerard Bouchar
>            Assignee: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.15
>
>         Attachments: NUTCH-2549.patch
>
>
> We identified the following issues in protocol-http (a plugin implementing the HTTP protocol):
>  * It fails if an url's path does not start with '/'
>  ** Example: [http://news.fx678.com?171|http://news.fx678.com/?171] (browsers correctly
rewrite the url as [http://news.fx678.com/?171], while nutch tries to send an invalid HTTP
request starting with *GET ?171 HTTP/1.0*.
>  * It advertises its requests as being HTTP/1.0, but sends an _Accept-Encoding_ request
header, that is defined only in HTTP/1.1. This confuses some web servers
>  ** Example: [http://www.hansamanuals.com/main/english/none/theconf___987/manuals/version___82/hwconvindex.htm]
>  * If a server sends a redirection (3XX status code, with a Location header), protocol-http
tries to parse the HTTP response body anyway. Thus, if an error occurs while decoding the
body, the redirection is not followed and the information is lost. Browsers follow the redirection
and close the socket soon as they can.
>  ** Example: [http://www.webarcelona.net/es/blog?page=2]
>  * Some servers invalidly send an HTTP body directly without a status line or headers.
Browsers handle that, protocol-http doesn't:
>  ** Example: [https://app.unitymedia.de/]
>  * Some servers invalidly add colons after the HTTP status code in the status line (they
can send _HTTP/1.1 404: Not found_ instead of _HTTP/1.1 404 Not found_ for instance). Browsers
can handle that.
>  * Some servers invalidly send headers that span over multiple lines. In that case, browsers
simply ignore the subsequent lines, but protocol-http throws an error, thus preventing us
from fetching the contents of the page.
>  * There is no limit over the size of the HTTP headers it reads. A bogus server could
send an infinite stream of different HTTP headers and cause the fetcher to go out of memory,
or send the same HTTP header repeatedly and cause the fetcher to timeout.
>  * The same goes for the HTTP status line: no check is made concerning its size.
>  * While reading chunked content, if the content size becomes larger than {color:#9876aa}http{color}.getMaxContent(),
instead of just stopping, it tries to read a new chunk before having read the previous one
completely, resulting in a '{color:#333333}bad chunk length' error.{color}
> {color:#333333}Additionally (and that concerns protocol-httpclient as well), when reading
http headers, for each header, the SpellCheckedMetadata class computes a Levenshtein distance
between it and every  known header in the HttpHeaders interface. Not only is that slow, non-standard,
and non-conform to browsers' behavior, but it also causes bugs and prevents us from accessing
the real headers sent by the HTTP server.{color}
>  * {color:#333333}Example: [http://www.taz.de/!443358/] . The server sends a *Client-Transfer-Encoding:
chunked* header, but SpellCheckedMetadata corrects it to *Transfer-Encoding: chunked*. Then,
HttpResponse (in protocol-http) tries to read the HTTP body as chunked, whereas it is not.{color}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message