nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
Date Thu, 30 Jan 2014 09:56:10 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13886453#comment-13886453
] 

Sebastian Nagel commented on NUTCH-1465:
----------------------------------------

Thanks, [~tejasp] for the improvements! Testings continued...

Sitemaps are treated same as ordinary URLs/docs. But there are some differences. Shouldn't
we relax default limits and filters and trust the restrictions specified in sitemap protocol?
* URL filters and normalizers: maybe you want to exclude .gz docs per suffix filter but still
fetch gzipped sitemaps. That's not possible. Is it really necessary to normalize/filter sitemap
URLs? If yes, this should be optional.
* default content limits {http,ftp,file}.content.limit (64 kB) are quite small even for mid-size
sitemaps. Ok, you could set it per {{-D...}} but why not increase it to SiteMapParser.MAX_BYTES_ALLOWED?
* maybe we want also increase the fetch timeout

Processing siitemap indexes fails:
* the check sitemap.isIndex() skips all referenced sitemaps
* protocol for sitemap index and referenced sub-sitemaps may be different (eg., one sub-sitemap
could be https while others are http)
* if processing one of the referenced sitemaps fails, the remaining sub-sitemaps are not processed

Fetch intervals are taken unchecked from <changefreq>. Should we llimit them to reasonable
values (db.fetch.schedule.adaptive.min_interval <= interval <= db.fetch.interval.max).
Fetch intervals of 1 second or 1 hour may cause troubles. [[1|http://www.sitemaps.org/protocol.html#xmlTagDefinitions]]
explicitely says that <changefreq> "is considered a hint and not a command".


> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Tejas Patil
>             Fix For: 1.8
>
>         Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch,
NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and
appears to have been used successfully to parse sitemaps as per the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message