nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <>
Subject [jira] [Commented] (NUTCH-1317) Max content length by MIME-type
Date Wed, 25 Sep 2013 20:08:05 GMT


Sebastian Nagel commented on NUTCH-1317:

Thanks, [~cguzel]! Patch looks possible (not tested yet). Two notes:
* instead of overriding the limit for each MIME type by its own property (e.g. {{}}),
a simple map file similar to {{adaptive-mimetypes.txt}} seems more extensible (and more efficient
because no String operations are done to look-up the max. length for a MIME type)
* code to determine max. length for given MIME type is doubled in patch: could be "centralized"
in HttpBase (plugin lib-http) which is inherited by protocol-http and httpclient
> Max content length by MIME-type
> -------------------------------
>                 Key: NUTCH-1317
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 2.3, 1.9
>         Attachments: NUTCH-1317.patch, NUTCH-1317-v2.patch
> The good old http.content.length directive is not sufficient in large internet crawls.
For example, a 5MB PDF file may be parsed without issues but a 5MB HTML file may time out.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message