nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doğacan Güney (JIRA) <j...@apache.org>
Subject [jira] Commented: (NUTCH-572) Scoring and redirected Urls
Date Wed, 07 Nov 2007 20:26:50 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540880
] 

Doğacan Güney commented on NUTCH-572:
-------------------------------------

I agree. However, transferring OPIC scores is a good improvement but it is not enough. 

The problem becomes incredibly complex when you consider recrawling.  We have to make sure
that cnn.com and cnn.com/refresh site has the same fetch interval, otherwise we will fetch
cnn.com/refresh more often than we fetch cnn.com (assuming we use AdaptiveFetchSchedule) so
score will build up in cnn.com but will not transfer to cnn.com/refresh (since we don't fetch
cnn.com). There are other problems (like, what to do if original site starts to redirect to
another site), but I think this is one of the more prominent ones.

> Scoring and redirected Urls
> ---------------------------
>
>                 Key: NUTCH-572
>                 URL: https://issues.apache.org/jira/browse/NUTCH-572
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8, 0.8.1, 0.9.0
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>
> When a redirect is found for a given url, the new or end url is stored as the content
page and the old CrawlDatum get one of a few redirect codes.  The page that gets indexed in
Nutch is the end page and it gets indexed under the end url.  Many times a site will have
a significant number of links pointing to start page and very few pointing to the redirected
end page.  This is especially true for external links.  Opic scores do not get transfered
to the end page but stay with the start page (the one doing the redirecting).  But the start
page doesn't get indexed.  Hence the end page will show up in the index but under a usually
much reduced score.  A good example of this is cnn.com:
> URL: http://www.cnn.com/
> Version: 6
> Status: 5 (db_redir_perm)
> Fetch time: Tue Dec 04 11:02:09 CST 2007
> Modified time: Wed Dec 31 18:00:00 CST 1969
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 51.19438
> Signature: b5baaf80e9e10aa6205fc39051c362ff
> Metadata: _pst_:success(1), lastModified=0
> which redirects to http://www.cnn.com/?refresh=1
> URL: http://www.cnn.com/?refresh=1
> Version: 6
> Status: 2 (db_fetched)
> Fetch time: Tue Dec 04 11:02:11 CST 2007
> Modified time: Wed Dec 31 18:00:00 CST 1969
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 1.0
> Signature: b5baaf80e9e10aa6205fc39051c362ff
> Metadata: _pst_:success(1), lastModified=0
> Now, cnn which should be one of the highest, if not the highest ranking site in the index
for keywords such as news in fact doesn't show up in the index and it's redirected end page
appears much farther down in search results.  My proposal is we somehow make OPIC scores follow
redirects.  To do this we would most likely need to store a start and end url for redirected
urls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message