nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lewis John McGibbney (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (NUTCH-1465) Support sitemaps in Nutch
Date Thu, 23 Jan 2014 14:39:37 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13879955#comment-13879955
] 

Lewis John McGibbney edited comment on NUTCH-1465 at 1/23/14 2:38 PM:
----------------------------------------------------------------------

Hey [~tejasp]. Again, great work! Some minor comments

* Class level Javadoc in SitemapProcessor would be more legible if it used format something
similar to
{code:title=SitemapProcessor.java|borderStyle=solid}
/**
 * <p>Performs Sitemap processing by fetching sitemap links, parsing the content and
merging
 * the urls from Sitemap (with the metadata) with the existing crawldb.</p>
 *
 * <p>There are two use cases supported in Nutch's Sitemap processing:</p>
 * <ol>
 *  <li>Sitemaps are considered as "remote seed lists". Crawl administrators can prepare
a
 *     list of sitemap links and get only those sitemap pages. This suits well for targeted
 *     crawl of specific hosts.</li>
 *  <li>For open web crawl, it is not possible to track each host and get the sitemap
links
 *     manually. Nutch would automatically get the sitemaps for all the hosts seen in the
 *     crawls and inject the urls from sitemap to the crawldb.</li>
 * </ol>
 * <p>For more details see:
 *      https://wiki.apache.org/nutch/SitemapFeature </o>
 */
{code}
* I think that the following logging line should be changed to WARN or ERROR
{code:title=SitemapProcessor.java|borderStyle=solid}
} catch (Exception e) {
+          LOG.info("Exception for url " + key.toString() + " : " + StringUtils.stringifyException(e));

{code}
* This is merely a suggestion, but in SitemapProcessor#filterNormalize(String u), could we
not use one of the methods from URLUtil.java instead?
{code:title=SitemapProcessor.java|borderStyle=solid}
      if(!u.startsWith("http://") && !u.startsWith("https://")) {
        // We received a hostname here so let's make a URL
        url = "http://" + u + "/";
        isHost = true;
      }
{code}

Thats about it from me mate. This looks like an excellent addition to Nutch again. I made
a trvial update to the wiki page to drop in some links and background to your work on this
one.

I should probably add, on local tests this works fine for me. E.g. injecting sitemap file
and from Hostdb.  


was (Author: lewismc):
Hey [~tejasp]. Again, great work! Some minor comments

* Class level Javadoc in SitemapProcessor would be more legible if it used format something
similar to
{code:title=SitemapProcessor.java|borderStyle=solid}
/**
 * <p>Performs Sitemap processing by fetching sitemap links, parsing the content and
merging
 * the urls from Sitemap (with the metadata) with the existing crawldb.</p>
 *
 * <p>There are two use cases supported in Nutch's Sitemap processing:</p>
 * <ol>
 *  <li>Sitemaps are considered as "remote seed lists". Crawl administrators can prepare
a
 *     list of sitemap links and get only those sitemap pages. This suits well for targeted
 *     crawl of specific hosts.</li>
 *  <li>For open web crawl, it is not possible to track each host and get the sitemap
links
 *     manually. Nutch would automatically get the sitemaps for all the hosts seen in the
 *     crawls and inject the urls from sitemap to the crawldb.</li>
 * </ol>
 * <p>For more details see:
 *      https://wiki.apache.org/nutch/SitemapFeature </o>
 */
{code}
* I think that the following logging line should be changed to WARN or ERROR
{code:title=SitemapProcessor.java|borderStyle=solid}
} catch (Exception e) {
+          LOG.info("Exception for url " + key.toString() + " : " + StringUtils.stringifyException(e));

{code}
* This is merely a suggestion, but in SitemapProcessor#filterNormalize(String u), could we
not use one of the methods from URLUtil.java instead?
{code:title=SitemapProcessor.java|borderStyle=solid}
      if(!u.startsWith("http://") && !u.startsWith("https://")) {
        // We received a hostname here so let's make a URL
        url = "http://" + u + "/";
        isHost = true;
      }
{code}

Thats about it from me mate. This looks like an excellent addition to Nutch again. I made
a trvial update to the wiki page to drop in some links and background to your work on this
one.

> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Tejas Patil
>             Fix For: 1.9
>
>         Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch,
NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and
appears to have been used successfully to parse sitemaps as per the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message