nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Emmanuel Joke (JIRA)" <j...@apache.org>
Subject [jira] Updated: (NUTCH-578) URL fetched with 403 is generated over and over again
Date Sun, 24 Feb 2008 14:55:14 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Emmanuel Joke updated NUTCH-578:
--------------------------------

    Attachment: NUTCH-578.patch

I've got the same error for page with an HTTP status code = 503.

I found the issue in the CrawlDbReduce class. The fetchtime was not refresh correctly according
to the DB Status.
My patch fix this issue.

> URL fetched with 403 is generated over and over again
> -----------------------------------------------------
>
>                 Key: NUTCH-578
>                 URL: https://issues.apache.org/jira/browse/NUTCH-578
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 1.0.0
>         Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I have checked
out the most recent version of the trunk as of Nov 20, 2007
>            Reporter: Nathaniel Powell
>             Fix For: 1.0.0
>
>         Attachments: crawl-urlfilter.txt, NUTCH-578.patch, nutch-site.xml, regex-normalize.xml,
urls.txt
>
>
> I have not changed the following parameter in the nutch-default.xml:
> <property>
>   <name>db.fetch.retry.max</name>
>   <value>3</value>
>   <description>The maximum number of times a url that has encountered
>   recoverable errors is generated for fetch.</description>
> </property>
> However, there is a URL which is on the site that I'm crawling, www.teachertube.com,
which keeps being generated over and over again for almost every segment (many more times
than 3):
> fetch of http://www.teachertube.com/images/ failed with: Http code=403, url=http://www.teachertube.com/images/
> This is a bug, right?
> Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message