nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Schneider (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-273) When a page is redirected, the original url is NOT updated.
Date Thu, 24 Aug 2006 06:49:16 GMT
    [ http://issues.apache.org/jira/browse/NUTCH-273?page=comments#action_12430117 ] 
            
Chris Schneider commented on NUTCH-273:
---------------------------------------

Another reason why it would be better to wait until the next segment to process the target
of the redirect is that this target may already have been fetched. In this case, there's no
need to refetch it. More importantly, though, refetching the page will cause its OPIC score
to be distributed a second time to its outlinks. In fact, each page that redirects to the
target page will cause the target page's OPIC score to get redistributed.

I honestly can't see a good reason for doing an immediate redirect, since hopefully these
cases aren't common enough to make a significant difference to crawling performance.

Note that there are several other issues related to this issue, so we should take care to
satisfy the goals of all with any fix. In particular, I agree that we should be saving more
information in the metadata about the redirection (as well as other protocol cases).

> When a page is redirected, the original url is NOT updated.
> -----------------------------------------------------------
>
>                 Key: NUTCH-273
>                 URL: http://issues.apache.org/jira/browse/NUTCH-273
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8
>         Environment: n/a
>            Reporter: Lukas Vlcek
>
> [Excerpt from maillist, sender: Andrzej Bialecki]
> When a page is redirected, the original url is NOT updated - so, CrawlDB will never know
that a redirect occured, it won't even know that a fetch occured... This looks like a bug.
> In 0.7 this was recorded in the segment, and then it would affect the Page status during
updatedb. It should do so 0.8, too...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message