nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ken Krugler (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1418) error parsing robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/
Date Mon, 02 Jul 2012 18:35:21 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13405223#comment-13405223
] 

Ken Krugler commented on NUTCH-1418:
------------------------------------

The path is invalid, so Nutch emitting a warning is fine.

If Nutch subsequently bails on processing URLs for such a web site, then that would be a problem
- but I don't think that's the case here, as it's being logged as a warning, not an error,
and it obviously keeps processing the file (since you get three such warnings).

Are you _sure_ that the reason Nutch isn't fetching is caused by this issue with robots.txt?
I'm pretty sure many people use Nutch to crawl Wikipedia.
                
> error parsing robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/
> ------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1418
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1418
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Arijit Mukherjee
>
> Since learning that nutch will be unable to crawl the javascript function calls in href,
I started looking for other alternatives. I decided to crawl http://en.wikipedia.org/wiki/Districts_of_India.
>     I first tried injecting this URL and follow the step-by-step approach till fetcher
- when I realized, nutch did not fetch anything from this website. I tried looking into logs/hadoop.log
and found the following 3 lines - which I believe could be saying that nutch is unable to
parse the robots.txt in the website and ttherefore, fetcher stopped?
>    
>     2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing robots rules-
can't decode path: /wiki/Wikipedia%3Mediation_Committee/
>     2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing robots rules-
can't decode path: /wiki/Wikipedia_talk%3Mediation_Committee/
>     2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing robots rules-
can't decode path: /wiki/Wikipedia%3Mediation_Cabal/Cases/
>     I tried checking the URL using parsechecker and no issues there! I think it means
that the robots.txt is malformed for this website, which is preventing fetcher from fetching
anything. Is there a way to get around this problem, as parsechecker seems to go on its merry
way parsing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message