nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-2707) protocol-okhttp fails to decompress content if Content-Encoding header is wrong
Date Sun, 07 Apr 2019 14:26:00 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sebastian Nagel updated NUTCH-2707:
-----------------------------------
    Summary: protocol-okhttp fails to decompress content if Content-Encoding header is wrong
 (was: protocol-okhttp fails to decompress gzip-encoded content)

> protocol-okhttp fails to decompress content if Content-Encoding header is wrong
> -------------------------------------------------------------------------------
>
>                 Key: NUTCH-2707
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2707
>             Project: Nutch
>          Issue Type: Bug
>          Components: plugin, protocol
>    Affects Versions: 1.15
>            Reporter: Sebastian Nagel
>            Priority: Minor
>             Fix For: 1.16
>
>
> The plugin protocol-okhttp does not decompress the returned gzipped content for some
rare pages.  Looks like that happens because the response HTTP header does not specify {{Content-Type:
gzip}} but {{zlib,gzip,deflate}}.
> {noformat}
> % bin/nutch parsechecker -Dplugin.includes='protocol-okhttp|parse-tika' \
>       -Dstore.http.headers=true -Dstore.http.request=true \
>       http://24310.gr/afroditi-42426.html
> fetching: http://24310.gr/afroditi-42426.html 
> ...
> contentType: application/gzip
> ...
> Content Metadata: Transfer-Encoding=chunked ... Content-Encoding=zlib,gzip,deflate ...
_request_=GET /afroditi-42426.html HTTP/1.1
> ...
> Accept-Encoding: gzip
>  _response.headers_=HTTP/1.1 200 OK
> ...
> Content-Encoding: zlib,gzip,deflate
> ...
> Transfer-Encoding: chunked
> Connection: keep-alive
> {noformat}
> The plugin protocol-http requests {{Accept-Encoding: x-gzip, gzip, deflate}} and gets
the correct response header:
> {noformat}
> % bin/nutch parsechecker -Dplugin.includes='protocol-http|parse-tika' \
>        -Dstore.http.headers=true -Dstore.http.request=true http://24310.gr/afroditi-42426.html
> ...
> contentType: application/xhtml+xml
> ...
> Content Metadata: ... Content-Encoding=gzip ... _request_=GET /afroditi-42426.html HTTP/1.1
> Host: 24310.gr
> Accept-Encoding: x-gzip, gzip, deflate
> ...
> {noformat}
> Similar for Firefox which sends {{Accept-Encoding: gzip, deflate}}.
> I will report the issue to upstream okhttp. But it would be also possible to handle the
content encoding in the protocol implementation: if the Accept-Encoding header is set, the
okhttp library will not decompress the content and expects that it's handled in the calling
code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message