nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doğacan Güney (JIRA) <j...@apache.org>
Subject [jira] Commented: (NUTCH-547) Redirection handling: YahooSlurp's algorithm
Date Mon, 10 Sep 2007 20:44:29 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12526263
] 

Doğacan Güney commented on NUTCH-547:
-------------------------------------

>> > I'm not sure why the patch to Indexer.java tries to overwrite reprUrl from fetchDatum
with the value from dbDatum [..]
>
> I'm still not sure about this issue - could you please clarify?

Sorry, it seems I forgot to answer it :) 

It is possible that we discover a meta-redirect during parse phase. We have no way of updating
fetch datum-s at this point, so instead, parse writes this information to crawl_parse which
of course is then passed to crawldb. So, during indexing, it is possible that dbDatum contains
(meta-)redirect information while fetchDatum does not. But you are right that we should probably
give some sort of priority to fetchDatum's metadata over dbDatum.

> Perhaps we can add reprUrl to a "repr" field instead?
> 
> Shouldn't this be the other way around - the idea of your patch is to put the data under
the reprUrl, so in order to minimize code changes you 
> replace the original url with reprUrl. This way we lose the value of the original url,
so it seems to me that if we want to preserve it we should add it
> to an "orig" field ..

OK, this makes sense to me. I guess we should make "orig" both indexed and stored?

---

Btw, one of the major issues with redirection (that this patch does not solve) is that scores/other
information are not reflected in redirections. Assume foo.com is a major web site. Url http://www.foo.com/
redirects to http://www.foo.com/daily.html . People, naturally, are much more likely to link
to http://www.foo.com/ then http://www.foo.com/daily.html (the problem is even more interesting
if http://foo.com also points to http://www.foo.com/daily.html ). So, I think we must have
some a way to "pass" the score from source site to redirection site. Same thing for adaptive
crawls: It may look like www.foo.com never changes (since it just redirects to a different
url). But it should be considered "modified" whenever page at redirect url is updated.

I am not sure how we can achieve this, though. We will probably need an extra job (that should
run at least once before indexing) that merges information from such pages.



> Redirection handling: YahooSlurp's algorithm
> --------------------------------------------
>
>                 Key: NUTCH-547
>                 URL: https://issues.apache.org/jira/browse/NUTCH-547
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: redirect_draft.patch
>
>
> After reading Yahoo's algorithm (then one Andrzej linked to:
> http://help.yahoo.com/l/nz/yahooxtra/search/webcrawler/slurp-11.html )
> in the redirect/alias handling discussion, I had a bit of a spare
> time, so I implemented it.
> Note that the patch I am attaching is for the 'choosing' algorithm described in
> Yahoo's help page. It makes no attempt to handle aliases in any way. (See http://www.nabble.com/Redirects-and-alias-handling-%28LONG%29-tf4270371.html#a12154362
for the discussion about alias handling).
> E.g,
> generate "http://www.milliyet.com.tr/"
> fetch "http:/www.milliyet.com.tr/" which redirects to
> "http://www.milliyet.com.tr/2007/08/29/index.html?ver=39".
> Update second page's datum's metadata to indicate that
> "http://www.milliyet.com.tr/" is the representative form.
> Updatedb, invertlinks, etc...
> While indexing second page, change its "url" field to
> "http://www.milliyet.com.tr/".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message