nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1475) Nutch 2.1 Index-More Plugin -- A better fall back value for date field
Date Thu, 11 Oct 2012 19:43:02 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13474460#comment-13474460
] 

Sebastian Nagel commented on NUTCH-1475:
----------------------------------------

Indeed, a modified time in the future is a bad choice.
But CrawlDatum and WebPage both have a field modifiedTime. It should contain the time of the
last fetch or (ideally) even the time of former fetch if the document is not modified.
                
> Nutch 2.1 Index-More Plugin -- A better fall back value for date field
> ----------------------------------------------------------------------
>
>                 Key: NUTCH-1475
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1475
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 2.1, 1.5.1
>         Environment: All
>            Reporter: James Sullivan
>            Priority: Minor
>              Labels: index-more, plugins
>             Fix For: 1.6, 2.2
>
>         Attachments: index-more-1xand2x.patch, index-more-2x.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Among other fields, the more plugin for Nutch 2.x provides a "last modified" and "date"
field for the Solr index. The "last modified" field is the last modified date from the http
headers if available, if not available it is left empty. Currently, the "date" field is the
same as the "last modified" field unless that field is empty in which case getFetchTime is
used as a fall back. I think getFetchTime is not a good fall back as it is the next fetch
time and often a month or more in the future which doesn't make sense for the date field.
Users do not expect webpages/documents with future dates. A more sensible fallback would be
current date at the time it is indexed. 
> This is possible by simply changing line 97 of https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
from
> time = page.getFetchTime(); // use fetch time
> to
> time = new Date().getTime();
> Users interested in the getFetchTime value can still get it from the "tstamp" field.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message