nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ken Krugler (JIRA)" <>
Subject [jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
Date Mon, 07 Jan 2013 17:52:13 GMT


Ken Krugler commented on NUTCH-1031:

Based on my reading of the robots.txt RFC ("The robot must obey the first record in /robots.txt
that contains a User-Agent line whose value contains the name token of the robot as a substring."),
this seems like the User-Agent name (what's in the robots.txt file) is searched for a substring
that matches the robot name token (what the caller is using).

So that means in CC we'd either need to assume that a robot name _never_ contains a comma
(and we split the caller-provided name) or we add a new API where you pass in a list of robot
names. Thoughts?
> Delegate parsing of robots.txt to crawler-commons
> -------------------------------------------------
>                 Key: NUTCH-1031
>                 URL:
>             Project: Nutch
>          Issue Type: Task
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>            Priority: Minor
>              Labels: robots.txt
>             Fix For: 1.7
>         Attachments: NUTCH-1031.v1.patch
> We're about to release the first version of Crawler-Commons []
which contains a parser for robots.txt files. This parser should also be better than the one
we currently have in Nutch. I will delegate this functionality to CC as soon as it is available

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message