nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (Jira)" <j...@apache.org>
Subject [jira] [Created] (NUTCH-2760) protocol-okhttp: properly record HTTP version in request message header
Date Fri, 13 Dec 2019 11:36:00 GMT
Sebastian Nagel created NUTCH-2760:
--------------------------------------

             Summary: protocol-okhttp: properly record HTTP version in request message header
                 Key: NUTCH-2760
                 URL: https://issues.apache.org/jira/browse/NUTCH-2760
             Project: Nutch
          Issue Type: Bug
          Components: plugin, protocol
    Affects Versions: 1.16
            Reporter: Sebastian Nagel
             Fix For: 1.17


The HTTP version in the request message tracked by the plugin protocol-okhttp ({{store.http.request=true}})
is not the version sent in the request but that received from the response.

Note that the HTTP version sent in the request may differ from that sent back in the response.
One example (tracked using wget):

{noformat}
> wget -d https://www.kp.ru/daily/27061/4129507/
...
---request begin---
GET /daily/27061/4129507/ HTTP/1.1
User-Agent: Wget/1.20.3 (linux-gnu)
Accept: */*
Accept-Encoding: identity
Host: www.kp.ru
Connection: Keep-Alive

---request end---
HTTP request sent, awaiting response... 
---response begin---
HTTP/1.0 200 OK
...
{noformat}

protocol-http uses the response version ("HTTP/1.0") also for the request:

{noformat}
> bin/nutch parsechecker -Dstore.http.headers=true -Dstore.http.request=true \
     -Dplugin.includes='protocol-okhttp|parse-html' https://www.kp.ru/daily/27061/4129507/
...
_request_=GET /daily/27061/4129507/ HTTP/1.0
...
_response.headers_=HTTP/1.0 200 OK
...
{noformat}


The protocol-http tracks the versions correctly:

{noformat}
> bin/nutch parsechecker -Dstore.http.headers=true -Dstore.http.request=true \
     -Dplugin.includes='protocol-http|parse-html' https://www.kp.ru/daily/27061/4129507/
...
_request_=GET /daily/27061/4129507/ HTTP/1.1
...
_response.headers_=HTTP/1.0 200 OK
...
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message