nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <>
Subject [jira] [Resolved] (NUTCH-2666) Increase default value for http.content.limit / ftp.content.limit / file.content.limit
Date Wed, 10 Apr 2019 11:41:00 GMT


Sebastian Nagel resolved NUTCH-2666.
    Resolution: Implemented

Merged in to master, will be available in 1.16. Thanks, [~mebbinghaus]!

> Increase default value for http.content.limit / ftp.content.limit / file.content.limit
> --------------------------------------------------------------------------------------
>                 Key: NUTCH-2666
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.15
>            Reporter: Marco Ebbinghaus
>            Assignee: Sebastian Nagel
>            Priority: Minor
>             Fix For: 1.16
> The default value for http.content.limit in nutch-default.xml (The length limit for
downloaded content using the http://
>  protocol, in bytes. If this value is nonnegative (>=0), content longer
>  than it will be truncated; otherwise, no truncation at all. Do not
>  confuse this setting with the file.content.limit setting.) is set to 64kb. Maybe this
default value should be increased as many pages today are greater than 64kb.
> This fact hit me when trying to crawl a single website whose pages are much greater than
64kb and because of that with every crawl cycle the count of db_unfetched urls decreased until
it hit zero and the crawler became inactive (because the first 64 kB contained always the
same set of navigation links)
> The description might also be updated as this is not only the case for the http protocol,
but also for https.

This message was sent by Atlassian JIRA

View raw message