nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche (JIRA)" <>
Subject [jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
Date Sun, 20 Jan 2013 07:36:16 GMT


Julien Nioche commented on NUTCH-1031:

bq. 1. Continue to have the legacy code for parsing robots file. 
bq. 2. As an add-in, crawler-commons can be employed for the parsing. User can pick based
on a config parameter with a note indicating that #2 wont work with multiple HTTP agents.

2 is an overkill IMHO. the existing code works fine and the point in moving to CC was to get
rid of some of our code, not make it bigger with yet another configuration. 

Lewis : donating out code is a good idea but in the case of the robots parsing it's more about
modifying the existing one in CC. I haven't had time to look at robot parsing in CC and am
not familiar with it but it would be a good thing to improve it. In the meantime let's go
for option 1. Thanks!

> Delegate parsing of robots.txt to crawler-commons
> -------------------------------------------------
>                 Key: NUTCH-1031
>                 URL:
>             Project: Nutch
>          Issue Type: Task
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>            Priority: Minor
>              Labels: robots.txt
>             Fix For: 1.7
>         Attachments: NUTCH-1031.v1.patch
> We're about to release the first version of Crawler-Commons []
which contains a parser for robots.txt files. This parser should also be better than the one
we currently have in Nutch. I will delegate this functionality to CC as soon as it is available

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message