nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
Date Mon, 07 Jan 2013 15:24:12 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13545958#comment-13545958
] 

Julien Nioche commented on NUTCH-1031:
--------------------------------------

well we have 2 separate params : http.agent.name which is a single value sent to the servers
when fetching and http.robots.agents which can have multiple values and is used for parsing
robots. The value of this parameter SHOULD be split based on commas.

I don't think CC supports multiple values for http.robots.agents, but I'll ask Ken to be sure.
                
> Delegate parsing of robots.txt to crawler-commons
> -------------------------------------------------
>
>                 Key: NUTCH-1031
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1031
>             Project: Nutch
>          Issue Type: Task
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>            Priority: Minor
>              Labels: robots.txt
>             Fix For: 1.7
>
>         Attachments: NUTCH-1031.v1.patch
>
>
> We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/]
which contains a parser for robots.txt files. This parser should also be better than the one
we currently have in Nutch. I will delegate this functionality to CC as soon as it is available
publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message