nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lewis John McGibbney (JIRA)" <>
Subject [jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
Date Thu, 23 Jan 2014 14:37:39 GMT


Lewis John McGibbney commented on NUTCH-1465:

Hey [~tejasp]. Again, great work! Some minor comments

* Class level Javadoc in SitemapProcessor would be more legible if it used format something
similar to
 * <p>Performs Sitemap processing by fetching sitemap links, parsing the content and
 * the urls from Sitemap (with the metadata) with the existing crawldb.</p>
 * <p>There are two use cases supported in Nutch's Sitemap processing:</p>
 * <ol>
 *  <li>Sitemaps are considered as "remote seed lists". Crawl administrators can prepare
 *     list of sitemap links and get only those sitemap pages. This suits well for targeted
 *     crawl of specific hosts.</li>
 *  <li>For open web crawl, it is not possible to track each host and get the sitemap
 *     manually. Nutch would automatically get the sitemaps for all the hosts seen in the
 *     crawls and inject the urls from sitemap to the crawldb.</li>
 * </ol>
 * <p>For more details see:
 * </o>
* I think that the following logging line should be changed to WARN or ERROR
} catch (Exception e) {
+"Exception for url " + key.toString() + " : " + StringUtils.stringifyException(e));

* This is merely a suggestion, but in SitemapProcessor#filterNormalize(String u), could we
not use one of the methods from instead?
      if(!u.startsWith("http://") && !u.startsWith("https://")) {
        // We received a hostname here so let's make a URL
        url = "http://" + u + "/";
        isHost = true;

Thats about it from me mate. This looks like an excellent addition to Nutch again. I made
a trvial update to the wiki page to drop in some links and background to your work on this

> Support sitemaps in Nutch
> -------------------------
>                 Key: NUTCH-1465
>                 URL:
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Tejas Patil
>             Fix For: 1.9
>         Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch,
NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and
appears to have been used successfully to parse sitemaps as per the discussion here[1].
> [0]
> [1]

This message was sent by Atlassian JIRA

View raw message