nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-547) Redirection handling: YahooSlurp's algorithm
Date Mon, 10 Sep 2007 20:25:29 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12526258
] 

Andrzej Bialecki  commented on NUTCH-547:
-----------------------------------------

> > I'm not sure why the patch to Indexer.java tries to overwrite reprUrl from fetchDatum
with the value from dbDatum [..]

I'm still not sure about this issue - could you please clarify?

> Perhaps we can add reprUrl to a "repr" field instead?

Shouldn't this be the other way around - the idea of your patch is to put the data under the
reprUrl, so in order to minimize code changes you replace the original url with reprUrl. This
way we lose the value of the original url, so it seems to me that if we want to preserve it
we should add it to an "orig" field ..

> Redirection handling: YahooSlurp's algorithm
> --------------------------------------------
>
>                 Key: NUTCH-547
>                 URL: https://issues.apache.org/jira/browse/NUTCH-547
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: redirect_draft.patch
>
>
> After reading Yahoo's algorithm (then one Andrzej linked to:
> http://help.yahoo.com/l/nz/yahooxtra/search/webcrawler/slurp-11.html )
> in the redirect/alias handling discussion, I had a bit of a spare
> time, so I implemented it.
> Note that the patch I am attaching is for the 'choosing' algorithm described in
> Yahoo's help page. It makes no attempt to handle aliases in any way. (See http://www.nabble.com/Redirects-and-alias-handling-%28LONG%29-tf4270371.html#a12154362
for the discussion about alias handling).
> E.g,
> generate "http://www.milliyet.com.tr/"
> fetch "http:/www.milliyet.com.tr/" which redirects to
> "http://www.milliyet.com.tr/2007/08/29/index.html?ver=39".
> Update second page's datum's metadata to indicate that
> "http://www.milliyet.com.tr/" is the representative form.
> Updatedb, invertlinks, etc...
> While indexing second page, change its "url" field to
> "http://www.milliyet.com.tr/".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message