nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tejas Patil (JIRA)" <>
Subject [jira] [Updated] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
Date Tue, 22 Jan 2013 03:06:15 GMT


Tejas Patil updated NUTCH-1031:

    Attachment: NUTCH-1031-trunk.v2.patch

Added a patch for nutch trunk (NUTCH-1031-trunk.v2.patch). If nobody has objection, i will
work on corresponding patch for 2.x.
Summary of the changes done:
- Removed RobotRules class as CC provides a replacement: BaseRobotRules
- Moved RobotRulesParser from http plugin in account to NUTCH-1513, other protocols might
share the it.
- Added HttpRobotRulesParser which will be responsible for getting the robots file using http
- Changed references from old nutch classes to classes from CC.
> Delegate parsing of robots.txt to crawler-commons
> -------------------------------------------------
>                 Key: NUTCH-1031
>                 URL:
>             Project: Nutch
>          Issue Type: Task
>            Reporter: Julien Nioche
>            Assignee: Tejas Patil
>            Priority: Minor
>              Labels: robots.txt
>             Fix For: 1.7
>         Attachments: CC.robots.multiple.agents.patch, CC.robots.multiple.agents.v2.patch,
NUTCH-1031-trunk.v2.patch, NUTCH-1031.v1.patch
> We're about to release the first version of Crawler-Commons []
which contains a parser for robots.txt files. This parser should also be better than the one
we currently have in Nutch. I will delegate this functionality to CC as soon as it is available

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message