nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Trivial Update of "CrawlDatumStates" by SebastianNagel
Date Tue, 06 Dec 2011 21:38:11 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "CrawlDatumStates" page has been changed by SebastianNagel:
http://wiki.apache.org/nutch/CrawlDatumStates?action=diff&rev1=5&rev2=6

Comment:
restored part accidentally deleted in revision #3 (2011-11-21 15:36:01)

  
  If there was a temporary problem in fetching (e.g. exception or time out) then this URL
is left as "unfetched" but its retry counter is incremented. If this counter reaches a limit
(default is 3) the page is marked as "gone". Pages that are "gone" are not considered for
fetching by Generator for a long time, which is the maxFetchInterval (e.g. 180 days) - the
reason for keeping them is that even gone pages may re-appear after a while, and also we want
to avoid re-discovering them and giving them a status of "unfetched".
  
- Other possible states after fetching are "truly gone" ;) (e.g. forbidden by robots.txt or
unauthorized), which get the same treatment as described above - that is after a long period
of time we check again their status, which ma
+ Other possible states after fetching are "truly gone" ;) (e.g. forbidden by robots.txt or
unauthorized), which get the same treatment as described above - that is after a long period
of time we check again their status, which may have changed.
  
+ In case of "success" we mark this URL as "fetched". This URL is not eligible for re-fetching
until after fetchInterval, at which point it's considered outdated and in need of re-fetching
(i.e. the same as "unfetched").
+ 

Mime
View raw message