nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2715) WARCExporter fails on large records
Date Mon, 06 May 2019 14:12:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16833858#comment-16833858
] 

Sebastian Nagel commented on NUTCH-2715:
----------------------------------------

Hi [~yossi], thanks! Unfortunately both WARC writer tools shipped with Nutch 1.x are not ideal:
 - commoncrawldump: runs only locally (not in distributed mode), a license issue with a transitive
dependency (see NUTCH-2622)
 - warc (org.apache.nutch.tools.warc.WARCExporter): based on a libary ([warc-hadoop|https://github.com/ept/warc-hadoop])
which isn't maintained anymore, does not write info and request records

Both tools:
 - cannot write WARC files with per-record compression which is recommended and the de-facto
standard as some WARC readers expect every record in a single gzip stream
 - any fixes to the HTTP headers are "delegated" to the protocol plugins which isn't a good
solution. Nutch stores the content de-chunked and uncompressed and wrong headers (namely {{Content-Encoding}},
{{Content-Length}} and {{Transfer-Encoding}}) may also cause WARC readers to fail

I'm sorry about this but I would opt not to use WARCExporter in its current state at all.
Maybe you can use the [WARC writer in Common Crawl's fork of Nutch|https://github.com/apache/nutch/compare/master...commoncrawl:cc-warc-writer].
I plan since long to push it upstream (with CC-specific features removed) but never found
the time.

> WARCExporter fails on large records
> -----------------------------------
>
>                 Key: NUTCH-2715
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2715
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.15
>            Reporter: Yossi Tamari
>            Priority: Major
>
> com.martinkl.warc.WARCRecord throws an IllegalStateException when a single line is over
10,000 bytes. Since this exception is not caught in WARCExporter, it fails the whole export.
> I doubt that validity of the limitation in WARCRecord, but regardless, I think WARCExporter
should catch the exception and skip to the next record.
> (See also [https://github.com/ept/warc-hadoop/issues/5])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message