nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yossi Tamari (JIRA)" <>
Subject [jira] [Commented] (NUTCH-2715) WARCExporter fails on large records
Date Tue, 07 May 2019 13:24:00 GMT


Yossi Tamari commented on NUTCH-2715:

It seems to me like the commoncrawldump plugin is literally useless - not only is it not
distributed, it is not even multi-threaded, and it takes many hours to process a single segment
(45 minutes of fetching). It also creates a file per URL, which is pretty disastrous for the
file system.

> WARCExporter fails on large records
> -----------------------------------
>                 Key: NUTCH-2715
>                 URL:
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.15
>            Reporter: Yossi Tamari
>            Priority: Major
> com.martinkl.warc.WARCRecord throws an IllegalStateException when a single line is over
10,000 bytes. Since this exception is not caught in WARCExporter, it fails the whole export.
> I doubt that validity of the limitation in WARCRecord, but regardless, I think WARCExporter
should catch the exception and skip to the next record.
> (See also [])

This message was sent by Atlassian JIRA

View raw message