nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content
Date Wed, 17 Jan 2007 19:34:30 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465517
] 

Andrzej Bialecki  commented on NUTCH-61:
----------------------------------------

Actually, there is a way to do this, and this patch implements it.

We define a maximum "time to live" for _any_ page, no matter when it was last fetched or what
is its re-fetch interval. This is a system-wide setting. If re-fetch interval is longer than
this value, or somehow the page wasn't re-fetched at least that long for other reasons (e.g.
because it was unmodified, and we don't fetch unmodified content) - such pages will be forcefully
included in fetchlist candidates as if they had DB_UNFETCHED status.

This means we can be sure that any pages still present in segments older than this maximum
TTL will have been refetched, and we can safely discard all segments older than TTL.

> Adaptive re-fetch interval. Detecting umodified content
> -------------------------------------------------------
>
>                 Key: NUTCH-61
>                 URL: https://issues.apache.org/jira/browse/NUTCH-61
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>         Attachments: 20050606.diff, 20051230.txt, 20060227.txt, nutch-61-417287.patch
>
>
> Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual
pages change seldom or frequently. The goal of these changes is to extend the current codebase
to support various possible adjustments to re-fetch times and intervals, and specifically
a re-fetch schedule which tries to adapt the period between consecutive fetches to the period
of content changes.
> Also, these patches implement checking if the content has changed since last fetching;
protocol plugins are also changed to make use of this information, so that if content is unmodified
it doesn't have to be fetched and processed.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message