nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yossi Tamari (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2715) WARCExporter fails on large records
Date Tue, 07 May 2019 13:24:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834751#comment-16834751
] 

Yossi Tamari commented on NUTCH-2715:
-------------------------------------

It seems to me like the commoncrawldump plugin is literally useless - not only is it not
distributed, it is not even multi-threaded, and it takes many hours to process a single segment
(45 minutes of fetching). It also creates a file per URL, which is pretty disastrous for the
file system.

> WARCExporter fails on large records
> -----------------------------------
>
>                 Key: NUTCH-2715
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2715
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.15
>            Reporter: Yossi Tamari
>            Priority: Major
>
> com.martinkl.warc.WARCRecord throws an IllegalStateException when a single line is over
10,000 bytes. Since this exception is not caught in WARCExporter, it fails the whole export.
> I doubt that validity of the limitation in WARCRecord, but regardless, I think WARCExporter
should catch the exception and skip to the next record.
> (See also [https://github.com/ept/warc-hadoop/issues/5])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message