nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <>
Subject [jira] Closed: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages
Date Thu, 28 Dec 2006 00:18:24 GMT
     [ ]

Andrzej Bialecki  closed NUTCH-322.

    Resolution: Fixed

Fixed in trunk/, rev. 490607 . NOTE: this doesn't solve the whole issue of proper handling
of redirected pages from the point of view of scoring and LinkDB, but it does solve the original
issue described here.

> Fetcher discards ProtocolStatus, doesn't store redirected pages
> ---------------------------------------------------------------
>                 Key: NUTCH-322
>                 URL:
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>             Fix For: 0.9.0
> Fetcher doesn't store ProtocolStatus in output segments. ProtocolStatus contains important
information, such as protocol-level response code, lastModified time, and possibly other messages.
> I propose that ProtocolStatus should be stored inside CrawlDatum.metaData, which is then
stored into crawl_fetch (in Fetcher.FetcherThread.output()). In addition, if ProtocolStatus
contains a valid lastModified time, that CrawlDatum's modified time should also be set to
this value.
> Additionally, Fetcher doesn't store redirected pages. Content of such pages is silently
discarded. When Fetcher translates from protocol-level status to crawldb-level status it should
probably store such pages with the following translation of status codes:
> * ProtocolStatus.TEMP_MOVED -> CrawlDatum.STATUS_DB_RETRY. This code indicates a transient
change, so we probably shouldn't mark the initial URL as bad.
> * ProtocolStatus.MOVED -> CrawlDatum.STATUS_DB_GONE. This code indicates a permanent
change, so the initial URL is no longer valid, i.e. it will always result in redirects.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:


View raw message