nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joris Rau (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-2213) CommonCrawlDataDumper saves gzipped body in extracted form
Date Wed, 10 Feb 2016 10:32:18 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Joris Rau updated NUTCH-2213:
-----------------------------
    Description: 
I have downloaded [a WARC file|https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-40/segments/1443738099622.98/warc/CC-MAIN-20151001222139-00240-ip-10-137-6-227.ec2.internal.warc.gz]
from the common crawl data. This file contains several gzipped responses which are stored
plaintext (without the gzip encoding).

I used [warctools|https://github.com/internetarchive/warctools] from Internet Archive to extract
the responses out of the WARC file. However this tool expects the Content-Length field to
match the actual length of the body in the WARC ([See the issue on github|https://github.com/internetarchive/warctools/pull/14#issuecomment-182048962]).
warctools uses a more up to date version of hanzo warctools which is recommended on the [Common
Crawl website|https://commoncrawl.org/the-data/get-started/] under "Processing the file format".

I have not been using Nutch and can therefore not say which versions are affected by this.

After reading [the official WARC draft|http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html]
I could not find out how gzipped content is supposed to be stored. However probably multiple
WARC file parsers will have an issue with this.

It would be nice to know whether you consider this a bug and plan on fixing this and whether
this is a major issue which concerns most WARC files of the Common Crawl data or only a small
part.

  was:
I have downloaded [a WARC file|https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-40/segments/1443738099622.98/warc/CC-MAIN-20151001222139-00240-ip-10-137-6-227.ec2.internal.warc.gz]
from the common crawl data. This file contains several gzipped responses which are stored
plaintext (without the gzip encoding).
I used [warctools|https://github.com/internetarchive/warctools] from Internet Archive to extract
the responses out of the WARC file. However this tool expects the Content-Length field to
match the actual length of the body in the WARC ([See the issue on github|https://github.com/internetarchive/warctools/pull/14#issuecomment-182048962]).
warctools uses a more up to date version of hanzo warctools which is recommended on the [Common
Crawl website|https://commoncrawl.org/the-data/get-started/] under "Processing the file format".
I have not been using Nutch and can therefore not say which versions are affected by this.
After reading [the official WARC draft|http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html]
I could not find out how gzipped content is supposed to be stored. However probably multiple
WARC file parsers will have an issue with this.
It would be nice to know whether you consider this a bug and plan on fixing this and whether
this is a major issue which concerns most WARC files of the Common Crawl data or only a small
part.


> CommonCrawlDataDumper saves gzipped body in extracted form
> ----------------------------------------------------------
>
>                 Key: NUTCH-2213
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2213
>             Project: Nutch
>          Issue Type: Bug
>          Components: commoncrawl, dumpers
>            Reporter: Joris Rau
>            Priority: Critical
>              Labels: easyfix
>
> I have downloaded [a WARC file|https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-40/segments/1443738099622.98/warc/CC-MAIN-20151001222139-00240-ip-10-137-6-227.ec2.internal.warc.gz]
from the common crawl data. This file contains several gzipped responses which are stored
plaintext (without the gzip encoding).
> I used [warctools|https://github.com/internetarchive/warctools] from Internet Archive
to extract the responses out of the WARC file. However this tool expects the Content-Length
field to match the actual length of the body in the WARC ([See the issue on github|https://github.com/internetarchive/warctools/pull/14#issuecomment-182048962]).
warctools uses a more up to date version of hanzo warctools which is recommended on the [Common
Crawl website|https://commoncrawl.org/the-data/get-started/] under "Processing the file format".
> I have not been using Nutch and can therefore not say which versions are affected by
this.
> After reading [the official WARC draft|http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html]
I could not find out how gzipped content is supposed to be stored. However probably multiple
WARC file parsers will have an issue with this.
> It would be nice to know whether you consider this a bug and plan on fixing this and
whether this is a major issue which concerns most WARC files of the Common Crawl data or only
a small part.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message