nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ken Krugler (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time
Date Mon, 02 Oct 2006 20:26:29 GMT
    [ http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439304 ] 
            
Ken Krugler commented on NUTCH-353:
-----------------------------------

+1 that the redirect target is not always the "real" URL that we want to keep.

For example, http://www.ibm.com/developerworks/lotus/downloads/toolkits.html => http://www-128.ibm.com/developerworks/lotus/downloads/toolkits.html.
This holds true for most  (all?) developerWorks pages; they redirect to www-128.ibm.com/<whatever>,
but IBM would love for the URL everybody sees to still be www.ibm.com/<whatever>.

> pages that serverside forwards will be refetched every time
> -----------------------------------------------------------
>
>                 Key: NUTCH-353
>                 URL: http://issues.apache.org/jira/browse/NUTCH-353
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8.1, 0.9.0
>            Reporter: Stefan Groschupf
>         Assigned To: Andrzej Bialecki 
>            Priority: Blocker
>             Fix For: 0.9.0
>
>         Attachments: doNotRefecthForwarderPagesV1.patch
>
>
> Pages that do a serverside forward are not written with a status change back into the
crawlDb. Also the nextFetchTime is not changed. 
> This causes a refetch of the same page again and again. The result is nutch is not polite
and refetching the forwarding and target page in each segment iteration. Also it effects the
scoring since the forward page contribute it's score to all outlinks.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message