nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doğacan Güney (JIRA) <j...@apache.org>
Subject [jira] Commented: (NUTCH-547) Redirection handling: YahooSlurp's algorithm
Date Tue, 04 Sep 2007 11:42:48 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12524693
] 

Doğacan Güney commented on NUTCH-547:
-------------------------------------

Thanks a lot for the quick review, Andrzej.

> * the patch uses a strange diff format ... the first lines of context diffs appear on
the same lines as chunk coordinates. 

Sorry about that. I am using git-svn (which, by the way, is an awesome tool) to develop nutch
so I may have forgotten to use "svn diff" for the patch.

> * in Fetcher[2].handleRedirect(), what happens when the selected reprUrl is the same
as the urlString? We should skip the 
> redirect then. 

We don't follow reprUrl,  we follow newUrl which is tested for equality with urlString. However,
we should probably avoid writing reprUrl in crawldatum metadata if it is the same as the urlString.

> * the repeating parsing of refreshTime should be hidden in a utility method in ParseStatus
- although the proper way to 
> support this would be to extend ParseStatus to store this int value if necessary, i.e.
if ParseStatus is SUCCESS_REDIRECT (we
> would have to bump the version number, too).

Good point. Will look into that.

> * minimum refreshTime should be at least a constant, or configurable, and not a literal.
Similarly the redirType should be a 
> constant. 

This patch is only a rough draft. I will fix all such issues in a later patch.

> * if we change the "url" field in BasicIndexingFilter, shouldn't we also change the "site"and
"host" fields? [...]

Wow, can't believe I missed that. 

> [..] We could also consider adding reprUrl as an additional value for the same "url"
field - this way we would get hits both on
>  the original url and the reprUrl. 

This may cause problems with dedup which assumes that "url" field has a single value. Also,
it may be difficult to decide which value of "url" to show in web UI. I also like that fact
that "url" is like a UNIQUE KEY for the document. If we allow "url" to have multiple values
we lose that. 

Perhaps we can add reprUrl to a "repr" field instead?

> Redirection handling: YahooSlurp's algorithm
> --------------------------------------------
>
>                 Key: NUTCH-547
>                 URL: https://issues.apache.org/jira/browse/NUTCH-547
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: redirect_draft.patch
>
>
> After reading Yahoo's algorithm (then one Andrzej linked to:
> http://help.yahoo.com/l/nz/yahooxtra/search/webcrawler/slurp-11.html )
> in the redirect/alias handling discussion, I had a bit of a spare
> time, so I implemented it.
> Note that the patch I am attaching is for the 'choosing' algorithm described in
> Yahoo's help page. It makes no attempt to handle aliases in any way. (See http://www.nabble.com/Redirects-and-alias-handling-%28LONG%29-tf4270371.html#a12154362
for the discussion about alias handling).
> E.g,
> generate "http://www.milliyet.com.tr/"
> fetch "http:/www.milliyet.com.tr/" which redirects to
> "http://www.milliyet.com.tr/2007/08/29/index.html?ver=39".
> Update second page's datum's metadata to indicate that
> "http://www.milliyet.com.tr/" is the representative form.
> Updatedb, invertlinks, etc...
> While indexing second page, change its "url" field to
> "http://www.milliyet.com.tr/".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message