nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <>
Subject [jira] Commented: (NUTCH-547) Redirection handling: YahooSlurp's algorithm
Date Mon, 03 Sep 2007 18:14:57 GMT


Andrzej Bialecki  commented on NUTCH-547:

A few comments:

* the patch uses a strange diff format ... the first lines of context diffs appear on the
same lines as chunk coordinates.

* in Fetcher[2].handleRedirect(), what happens when the selected reprUrl is the same as the
urlString? We should skip the redirect then.

* the repeating parsing of refreshTime should be hidden in a utility method in ParseStatus
- although the proper way to support this would be to extend ParseStatus to store this int
value if necessary, i.e. if ParseStatus is SUCCESS_REDIRECT (we would have to bump the version
number, too).

* minimum refreshTime should be at least a constant, or configurable, and not a literal. Similarly
the redirType should be a constant.

* parsing of the redirect time should be moved IMHO to handleRedirect(), to simplify the logic
in the

* if we change the "url" field in BasicIndexingFilter, shouldn't we also change the "site"and
"host" fields? We could also consider adding reprUrl as an additional value for the same "url"
field - this way we would get hits both on the original url and the reprUrl.

* I'm not sure why the patch to tries to overwrite reprUrl from fetchDatum with
the value from dbDatum - if anything, the value in fetchDatum should be more up to date, no?
as it is now, it's silently overwritten. The only way the reprUrl could end up in dbDatum
is from a previous updatedb operation, so it should contain an older information.

> Redirection handling: YahooSlurp's algorithm
> --------------------------------------------
>                 Key: NUTCH-547
>                 URL:
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Doğacan Güney
>             Fix For: 1.0.0
>         Attachments: redirect_draft.patch
> After reading Yahoo's algorithm (then one Andrzej linked to:
> )
> in the redirect/alias handling discussion, I had a bit of a spare
> time, so I implemented it.
> Note that the patch I am attaching is for the 'choosing' algorithm described in
> Yahoo's help page. It makes no attempt to handle aliases in any way. (See
for the discussion about alias handling).
> E.g,
> generate ""
> fetch "http:/" which redirects to
> "".
> Update second page's datum's metadata to indicate that
> "" is the representative form.
> Updatedb, invertlinks, etc...
> While indexing second page, change its "url" field to
> "".

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message