nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tejas Patil (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
Date Sun, 20 Jan 2013 10:10:14 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Tejas Patil updated NUTCH-1031:
-------------------------------

    Attachment: CC.robots.multiple.agents.patch

I looked at the source code of CC to understand how it works. I have identified the change
to be done to CC so that it supports multiple user agents. While testing the same, I have
found that there a semantic difference in the way CC works as compared to legacy nutch parser.

*What CC does:*
It will split the _http.robots.agents_ over comma (the change that i did locally)
It scans the robots file line by line, each time finding if there is a match of the current
"User-Agent" from file with any one of from  _http.robots.agents_. If match is found it will
take all the corresponding rules for that agent and stop further parsing. 

{noformat}robots file
User-Agent: Agent1 #foo
Disallow: /a

User-Agent: Agent2 Agent3
Disallow: /d
------------------------------------
http.robots.agents: "Agent2,Agent1"
------------------------------------
Path: "/a"{noformat}

For the example above, as soon as first line of robots file is scanned, a match for "Agent1"
is found. It will scan all the corresponding rules for that agent and will store only this
information:
{noformat}User-Agent: Agent1
Disallow: /a{noformat}

Rest all stuff is neglected.

*What nutch robots parser does:*
It will split the _http.robots.agents_ over comma. It scans ALL the lines of the robots file
and evaluates the matches in terms of the precedence of the user agents.
For above example, the rules corresponding to both Agent2 and Agent1 have a match in robots
file, but as Agent2 comes first in _http.robots.agents_, it is given priority and the rules
stored will be:
{noformat}User-Agent: Agent2
Disallow: /d{noformat}

If we want to leave behind the precendence based thing and adopt the model in CC, then I have
a small patch for crawler-commons (CC.robots.multiple.agents.patch).
                
> Delegate parsing of robots.txt to crawler-commons
> -------------------------------------------------
>
>                 Key: NUTCH-1031
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1031
>             Project: Nutch
>          Issue Type: Task
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>            Priority: Minor
>              Labels: robots.txt
>             Fix For: 1.7
>
>         Attachments: CC.robots.multiple.agents.patch, NUTCH-1031.v1.patch
>
>
> We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/]
which contains a parser for robots.txt files. This parser should also be better than the one
we currently have in Nutch. I will delegate this functionality to CC as soon as it is available
publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message