nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1418) error parsing robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/
Date Mon, 02 Jul 2012 18:50:59 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13405235#comment-13405235
] 

Markus Jelsma commented on NUTCH-1418:
--------------------------------------

There is no problem crawling Wikipedia indeed. Anyway, the warning is fine and the undecoded
path is being added to the rule set. Perhaps the path should be skipped, if it cannot be decoded
there's no need in storing it in the rule set, is there?


                
> error parsing robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/
> ------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1418
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1418
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Arijit Mukherjee
>
> Since learning that nutch will be unable to crawl the javascript function calls in href,
I started looking for other alternatives. I decided to crawl http://en.wikipedia.org/wiki/Districts_of_India.
>     I first tried injecting this URL and follow the step-by-step approach till fetcher
- when I realized, nutch did not fetch anything from this website. I tried looking into logs/hadoop.log
and found the following 3 lines - which I believe could be saying that nutch is unable to
parse the robots.txt in the website and ttherefore, fetcher stopped?
>    
>     2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing robots rules-
can't decode path: /wiki/Wikipedia%3Mediation_Committee/
>     2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing robots rules-
can't decode path: /wiki/Wikipedia_talk%3Mediation_Committee/
>     2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing robots rules-
can't decode path: /wiki/Wikipedia%3Mediation_Cabal/Cases/
>     I tried checking the URL using parsechecker and no issues there! I think it means
that the robots.txt is malformed for this website, which is preventing fetcher from fetching
anything. Is there a way to get around this problem, as parsechecker seems to go on its merry
way parsing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message