nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
Date Thu, 30 Jan 2014 11:04:10 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13886489#comment-13886489
] 

Sebastian Nagel commented on NUTCH-1465:
----------------------------------------

SitemapReducer overwrites score, modified time, and fetch interval of existing CrawlDb entries
with the values from sitemap. Is this the desired behavior? What about forgotten, hopeless
outdated sitemap? Or bogus values (last mod in the future)?
If a sitemap does not specify one of score, modified time, or fetch interval this values is
set to zero. In this case, we should definitely not overwrite existing values. Newly added
entries should get assigned db.fetch.interval.default and a reasonable score, eg. 0.5 as recommended
by [[2|http://www.sitemaps.org/protocol.html#xmlTagDefinitions]]. But that may depend on scoring
plugins. Comments?

> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Tejas Patil
>             Fix For: 1.8
>
>         Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch,
NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and
appears to have been used successfully to parse sitemaps as per the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message