nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doğacan Güney (JIRA) <j...@apache.org>
Subject [jira] Created: (NUTCH-547) Redirection handling: YahooSlurp's algorithm
Date Mon, 03 Sep 2007 07:47:18 GMT
Redirection handling: YahooSlurp's algorithm
--------------------------------------------

                 Key: NUTCH-547
                 URL: https://issues.apache.org/jira/browse/NUTCH-547
             Project: Nutch
          Issue Type: Improvement
          Components: fetcher
            Reporter: Doğacan Güney
             Fix For: 1.0.0


After reading Yahoo's algorithm (then one Andrzej linked to:
http://help.yahoo.com/l/nz/yahooxtra/search/webcrawler/slurp-11.html )
in the redirect/alias handling discussion, I had a bit of a spare
time, so I implemented it.

Note that the patch I am attaching is for the 'choosing' algorithm described in
Yahoo's help page. It makes no attempt to handle aliases in any way. (See http://www.nabble.com/Redirects-and-alias-handling-%28LONG%29-tf4270371.html#a12154362
for the discussion about alias handling).

E.g,
generate "http://www.milliyet.com.tr/"

fetch "http:/www.milliyet.com.tr/" which redirects to
"http://www.milliyet.com.tr/2007/08/29/index.html?ver=39".

Update second page's datum's metadata to indicate that
"http://www.milliyet.com.tr/" is the representative form.

Updatedb, invertlinks, etc...

While indexing second page, change its "url" field to
"http://www.milliyet.com.tr/".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message