nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2164) Inconsistent 'Modified Time' in crawl db
Date Wed, 11 May 2016 13:04:12 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280072#comment-15280072
] 

ASF GitHub Bot commented on NUTCH-2164:
---------------------------------------

GitHub user sebastian-nagel opened a pull request:

    https://github.com/apache/nutch/pull/108

    NUTCH-2164 NUTCH-2242 Inconsistent 'Modified Time' in crawl db / last…

    …Modified not always set
    
     - set modified time (time of last successful fetch) by DefaultFetchSchedule and AdaptiveFetchSchedule
       but only if the document is actually modified
     - update unit tests to check whether modification time is properly set
     - set modified time (sent by responding server in HTTP header) in ProtocolOutput:
       FetchSchedule implementations can access the HTTP modified time from CrawlDatum's
       metadata (PROTO_STATUS_KEY = "_pst_")

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sebastian-nagel/nutch NUTCH-2164

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/nutch/pull/108.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #108
    
----
commit b0c2969e47a3129a0abd0f98b736616ebaf5b540
Author: Sebastian Nagel <snagel@apache.org>
Date:   2016-03-11T21:55:24Z

    NUTCH-2164 NUTCH-2242 Inconsistent 'Modified Time' in crawl db / lastModified not always
set
     - set modified time (time of last successful fetch) by DefaultFetchSchedule and AdaptiveFetchSchedule
       but only if the document is actually modified
     - update unit tests to check whether modification time is properly set
     - set modified time (sent by responding server in HTTP header) in ProtocolOutput:
       FetchSchedule implementations can access the HTTP modified time from CrawlDatum's
       metadata (PROTO_STATUS_KEY = "_pst_")

----


> Inconsistent 'Modified Time' in crawl db
> ----------------------------------------
>
>                 Key: NUTCH-2164
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2164
>             Project: Nutch
>          Issue Type: Improvement
>          Components: crawldb, fetcher
>    Affects Versions: 1.11
>            Reporter: Thamme Gowda N
>            Priority: Minor
>
> The 'Modified time' in crawldb is invalid. It is set to (0-Timezone Difference)
> *How to verify/reproduce:*
>   Run 'nutch readdb /path/to/crawldb -dump yy' and then inspect content of 'yy'
> The following improvements can be done:
> 1. Set modified time by DefaultFetchSchedule
> 2. Set ProtocolStatus.lastModified if modified time is available in protocol response
headers
> This issue is also discussed in dev mailing lists: http://www.mail-archive.com/dev@nutch.apache.org/msg19803.html#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message