nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doğacan Güney (JIRA) <>
Subject [jira] Created: (NUTCH-547) Redirection handling: YahooSlurp's algorithm
Date Mon, 03 Sep 2007 07:47:18 GMT
Redirection handling: YahooSlurp's algorithm

                 Key: NUTCH-547
             Project: Nutch
          Issue Type: Improvement
          Components: fetcher
            Reporter: Doğacan Güney
             Fix For: 1.0.0

After reading Yahoo's algorithm (then one Andrzej linked to: )
in the redirect/alias handling discussion, I had a bit of a spare
time, so I implemented it.

Note that the patch I am attaching is for the 'choosing' algorithm described in
Yahoo's help page. It makes no attempt to handle aliases in any way. (See
for the discussion about alias handling).

generate ""

fetch "http:/" which redirects to

Update second page's datum's metadata to indicate that
"" is the representative form.

Updatedb, invertlinks, etc...

While indexing second page, change its "url" field to

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message