nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-1430) Freegenerator records overwrite CrawlDB records with AdaptiveFetchSchedule
Date Tue, 17 Jul 2012 12:36:34 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Markus Jelsma updated NUTCH-1430:
---------------------------------

    Attachment: NUTCH-1430-1.6-1.patch

Patch for 1.6. This fixes the issue by setting a default interval for CrawlDatum records without
one before proceeding with the scheduler's other code.


                
> Freegenerator records overwrite CrawlDB records with AdaptiveFetchSchedule
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-1430
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1430
>             Project: Nutch
>          Issue Type: Bug
>          Components: crawldb
>    Affects Versions: 1.5
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Critical
>             Fix For: 1.6
>
>         Attachments: NUTCH-1430-1.6-1.patch
>
>
> Steps to reproduce:
> Without AdaptiveFetchSchedule:
> {code}
> $ bin/nutch readdb crawl/crawldb/ -url http://www.openindex.io/en/home.html
> URL: http://www.openindex.io/en/home.html
> Version: 7
> Status: 2 (db_fetched)
> Fetch time: Thu Aug 16 13:58:23 CEST 2012
> Modified time: Thu Jan 01 01:00:00 CET 1970
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 0.0
> Signature: c2601ca503f2fc5edcb286501d7fb271
> Metadata: Content-Type: text/html_pst_: success(1), lastModified=0
> {code}
> With AdaptiveFetchSchedule:
> {code}
> $ bin/nutch readdb crawl/crawldb/ -url http://www.openindex.io/en/home.html
> URL: http://www.openindex.io/en/home.html
> Version: 7
> Status: 2 (db_fetched)
> Fetch time: Tue Jul 17 13:56:33 CEST 2012
> Modified time: Tue Jul 17 13:55:33 CEST 2012
> Retries since fetch: 0
> Retry interval: 60 seconds (0 days)
> Score: 0.0
> Signature: 23567bb52ee8b905b8649c4305ed82ee
> Metadata: Content-Type: text/html_pst_: success(1), lastModified=0
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message