nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lewis John McGibbney (JIRA)" <>
Subject [jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
Date Fri, 08 Mar 2013 20:26:13 GMT


Lewis John McGibbney commented on NUTCH-1031:

Hi Tejas. Sorry for taking forever to get around to this. 
* I really like to documentation within the patch. Big +1 for this
* Test all pass flawlessly.
* I like the retention of the main() method in o.a.n.p.RobotRulesParser
I've tested this on several websites, including many directories within sites like
(check out the robots.txt)
I am +1 for this Tejas. Good work on this one, its been a long time in coming to Nutch.
I am keen to hear from others.
> Delegate parsing of robots.txt to crawler-commons
> -------------------------------------------------
>                 Key: NUTCH-1031
>                 URL:
>             Project: Nutch
>          Issue Type: Task
>            Reporter: Julien Nioche
>            Assignee: Tejas Patil
>            Priority: Minor
>              Labels: robots.txt
>             Fix For: 1.7
>         Attachments: CC.robots.multiple.agents.patch, CC.robots.multiple.agents.v2.patch,
NUTCH-1031-trunk.v2.patch, NUTCH-1031-trunk.v3.patch, NUTCH-1031-trunk.v4.patch, NUTCH-1031.v1.patch
> We're about to release the first version of Crawler-Commons []
which contains a parser for robots.txt files. This parser should also be better than the one
we currently have in Nutch. I will delegate this functionality to CC as soon as it is available

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message