nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
Date Fri, 31 Jan 2014 09:32:09 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13887588#comment-13887588
] 

Sebastian Nagel commented on NUTCH-1465:
----------------------------------------

"filters and normalizers": -noFilter is not really an option if sitemaps are used and gzipped
documents (eg. software packages) shall be excluded. In customized crawls URL filter rules
are often complex, and I want to avoid to have to sets of rules in the end. Sitemaps are different
from normal docs/URLs (robots.txt is also different): they are not stored in CrawlDb and may
require other filter rules. What about an option "-noFilterSitemap"? 

"Fetch intervals of 1 second or 1 hour may cause troubles":
> We are blindly accepting user's custom information in inject.
Yes, because the user (crawl administrator) can change the seed list (it's a file/directory
on local disk or HDFS). Sitemaps are not necessarily under control of the user. If we (optionally)
adjust fetch interval by (configurable) min/max limits that would help to get unreasonable
values, and eg. re-fetch a bunch of pages every cycle.

"SitemapReducer overwriting" :
In a continuous crawl we know when pages are modified and have heuristics to estimate the
change frequency of a page (AdaptiveFetchSchedule). The question is whether we trust those
values which are achieved from crawling or prefer (possibly bogus) values from sitemaps. To
use the sitemap values for new URLs found in sitemaps is less critical.

> (a) score : Crawler commons assigns a default score of 0.5 if there was none provided
in sitemap.
Needs an upgrade of crawler-commons (0.2 is still used which sets priority to 0).

> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Tejas Patil
>             Fix For: 1.8
>
>         Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch,
NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and
appears to have been used successfully to parse sitemaps as per the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message