nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-1422) reset signature for redirects
Date Fri, 06 Jul 2012 14:09:34 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sebastian Nagel updated NUTCH-1422:
-----------------------------------

    Attachment: NUTCH-1422_redir_notmodified_log.txt
    
> reset signature for redirects
> -----------------------------
>
>                 Key: NUTCH-1422
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1422
>             Project: Nutch
>          Issue Type: Bug
>          Components: crawldb, fetcher
>    Affects Versions: 1.4
>            Reporter: Sebastian Nagel
>             Fix For: 1.6
>
>         Attachments: NUTCH-1422_redir_notmodified_log.txt
>
>
> In a long running continuous crawl with Nutch 1.4 URLs with a HTTP redirect (http.redirect.max
= 0) are kept as not-modified in the CrawlDb. Short protocol (cf. attached dumped segment
/ CrawlDb data):
>  2012-02-23 :  injected
>  2012-02-24 :  fetched
>  2012-03-30 :  re-fetched, signature changed
>  2012-04-20 :  re-fetched, redirected
>  2012-04-24 :  in CrawlDb as db_notmodified, still indexed with old content!
> The signature of a previously fetched document is not reset when the URL/doc is changed
to a redirect at a later time. CrawlDbReducer.reduce then sets the status to db_notmodified
because the new signature in with fetch status is identical to the old one.
> Possible fixes (??):
> * reset the signature in Fetcher
> * handle this case in CrawlDbReducer.reduce

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message