nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tejas Patil (JIRA)" <>
Subject [jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
Date Sat, 19 Jan 2013 06:26:13 GMT


Tejas Patil commented on NUTCH-1031:

After waiting for more than a week, I think that there is low chance of getting a fix / change
from crawler-commons. 
I propose following:
1. Continue to have the legacy code for parsing robots file.
2. As an add-in, crawler-commons can be employed for the parsing. 

User can pick based on a config parameter with a note indicating that #2 wont work with multiple
HTTP agents.
Should this be fine ?
> Delegate parsing of robots.txt to crawler-commons
> -------------------------------------------------
>                 Key: NUTCH-1031
>                 URL:
>             Project: Nutch
>          Issue Type: Task
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>            Priority: Minor
>              Labels: robots.txt
>             Fix For: 1.7
>         Attachments: NUTCH-1031.v1.patch
> We're about to release the first version of Crawler-Commons []
which contains a parser for robots.txt files. This parser should also be better than the one
we currently have in Nutch. I will delegate this functionality to CC as soon as it is available

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message