nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (NUTCH-2716) protocol-http: Response headers are not stored for a compressed response
Date Fri, 24 May 2019 13:28:00 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-2716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sebastian Nagel resolved NUTCH-2716.
------------------------------------
    Resolution: Fixed

Merged [PR #454|https://github.com/apache/nutch/pull/454]. Thanks, [~yossi]!

??Even when store.http.headers=true, the HTTP headers are not saved for a gzipped or deflated
response, because they may contain an incorrect content-length header. This causes WARCExporter
to generate "resource" (header-less) entries instead of "response" entries. The correct behavior
is to store all the headers, and code that uses them should be aware and careful that they
represent the original headers, not the stored content.??

??This fixes protocol-http, protocol-selenium, and protocol-htmlunit to write the raw response
headers, and adds logic to WARCExporter and CommonCrawlDataDumper to fix these headers.??

??It also fixed NUTCH-2715 (WARCExporter fails on large records), and upgrades lib-htmlunit
to use version 3.141.5 of Selenium, since Eclipse fails to compile otherwise (conflicts with
lib-selenium).??

> protocol-http: Response headers are not stored for a compressed response
> ------------------------------------------------------------------------
>
>                 Key: NUTCH-2716
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2716
>             Project: Nutch
>          Issue Type: Bug
>          Components: protocol
>    Affects Versions: 1.15
>            Reporter: Yossi Tamari
>            Assignee: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.16
>
>
> Even when store.http.headers=true, the HTTP headers are not saved for a gzipped or deflated
response, because they may contain an incorrect content-length header.
> This causes WARCExporter to generate "resource" (headerless) entries instead of "response"
entries.
> While I can see why reporting the wrong content-encoding and length may be a bug, removing
all the headers is not a fix.
> I am not submitting a patch yet since I'm not sure what the best fix is, but I guess
the best patch is to remove those two header lines and store the rest of the headers. If there
is no objection, I can submit a patch that does this. Otherwise, what would be a better fix?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message