nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <>
Subject [jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
Date Thu, 30 Jan 2014 09:56:10 GMT


Sebastian Nagel commented on NUTCH-1465:

Thanks, [~tejasp] for the improvements! Testings continued...

Sitemaps are treated same as ordinary URLs/docs. But there are some differences. Shouldn't
we relax default limits and filters and trust the restrictions specified in sitemap protocol?
* URL filters and normalizers: maybe you want to exclude .gz docs per suffix filter but still
fetch gzipped sitemaps. That's not possible. Is it really necessary to normalize/filter sitemap
URLs? If yes, this should be optional.
* default content limits {http,ftp,file}.content.limit (64 kB) are quite small even for mid-size
sitemaps. Ok, you could set it per {{-D...}} but why not increase it to SiteMapParser.MAX_BYTES_ALLOWED?
* maybe we want also increase the fetch timeout

Processing siitemap indexes fails:
* the check sitemap.isIndex() skips all referenced sitemaps
* protocol for sitemap index and referenced sub-sitemaps may be different (eg., one sub-sitemap
could be https while others are http)
* if processing one of the referenced sitemaps fails, the remaining sub-sitemaps are not processed

Fetch intervals are taken unchecked from <changefreq>. Should we llimit them to reasonable
values (db.fetch.schedule.adaptive.min_interval <= interval <= db.fetch.interval.max).
Fetch intervals of 1 second or 1 hour may cause troubles. [[1|]]
explicitely says that <changefreq> "is considered a hint and not a command".

> Support sitemaps in Nutch
> -------------------------
>                 Key: NUTCH-1465
>                 URL:
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Tejas Patil
>             Fix For: 1.8
>         Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch,
NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and
appears to have been used successfully to parse sitemaps as per the discussion here[1].
> [0]
> [1]

This message was sent by Atlassian JIRA

View raw message