nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Nutch Wiki] Trivial Update of "GoogleSummerOfCode/SitemapCrawler" by LewisJohnMcgibbney
Date Wed, 20 May 2015 22:49:45 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "GoogleSummerOfCode/SitemapCrawler" page has been changed by LewisJohnMcgibbney:

+ <<TableOfContents(4)>>
  == Abstract ==
  The url’s can be got from only pages that were scanned before in nutch crawler system.
This method is expensive. Also, the degrees of importance and “change frequance” of these
urls are not known only guessed. But, it is possible to find the whole of urls in a up-to-date
sitemap file. For this reason, sitemap files in website should be crawled. Nutch project will
have that support of sitemap crawler thanks to this development.

View raw message