nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Armel Nene (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content
Date Thu, 18 Jan 2007 10:00:30 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465700
] 

Armel Nene commented on NUTCH-61:
---------------------------------

I have attached a new patch as the old one need updating before using with Nutch 0.8.1. It
will be great if more people can test the feature as I have encounter some issues with plugins
such the parse-xml when used with this patch. Over http protocol the patch works well when
indexing text/xml/html. When used with a plugins such parse-xml, the fetcher throws a java
IllegalStateException. If anybody has this error and knows how to fix, please share it with
the rest of us. As of now, i'm working on trying to fix this issue and hoperfully adapt the
feature the 0.9.0 version. 

> Adaptive re-fetch interval. Detecting umodified content
> -------------------------------------------------------
>
>                 Key: NUTCH-61
>                 URL: https://issues.apache.org/jira/browse/NUTCH-61
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>         Attachments: 20050606.diff, 20051230.txt, 20060227.txt, nutch-61-417287.patch,
nutch-61-492176.patch
>
>
> Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual
pages change seldom or frequently. The goal of these changes is to extend the current codebase
to support various possible adjustments to re-fetch times and intervals, and specifically
a re-fetch schedule which tries to adapt the period between consecutive fetches to the period
of content changes.
> Also, these patches implement checking if the content has changed since last fetching;
protocol plugins are also changed to make use of this information, so that if content is unmodified
it doesn't have to be fetched and processed.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message