nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Nutch Wiki] Update of "GoogleSummerOfCode/SitemapCrawler/finalreport" by CihadGuzel
Date Sun, 23 Aug 2015 12:01:21 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "GoogleSummerOfCode/SitemapCrawler/finalreport" page has been changed by CihadGuzel:

final report 

New page:
= Support Sitemap Crawler in Nutch 2.x Midterm Report =

||'''Title :'''||||GSOC 2015 Midterm Report||
||'''Reporting Date :'''||||25th June 2015||
||'''Issue :'''|||| [[|NUTCH-1741
- Support Sitemap Crawler in Nutch 2.x]]||
||'''Student :'''||||Cihad Güzel -||
||'''Mentors :'''||||[[|Lewis John McGibbney]],
[[|Talat Uyarer]]||
||'''Development Codebase: :'''||||[[|Github
Repo Url]]||


== Abstract ==

The url’s can be got from only pages that were scanned before in nutch crawler system. This
method is expensive. But, it is possible to find the whole of urls in a up-to-date sitemap
file. For this reason, sitemap files in website should be crawled. Nutch project will have
that support of sitemap crawler thanks to this development.

== Introduction ==
Sitemap is a file guiding to crawl website in a better way and it has different file formats
(such as simple text format, xml format, rss 2.0, atom 0.3 & 1.0).
It is possible to find the whole of urls in a up-to-date sitemap file. Websites can be crawled
faster by means of sitemap crawler that will be developed. In addition, some knowledge can
be detected such as “change frequance”, “last update time” and “the priority”
of the pages. Shortly, a better url list will be got easily and fast from sitemap file thanks
to this software. It is another advantage that this process is under the control of the user.
Finally, when the project concluded;

 * Nutch project will have that support of sitemap crawler thanks to this development.
 * Better url list will be got by eliminating the sitemaps according to criteria of quality.
 * The sitemaps not wanted can be ignored
 * The management and configuration of sitemap crawler are under the control of user.

== Project Details ==
It is aimed to power nutch project by sitemap crawler support. The main target is to detect
the sitemap having correct urls and to be crawled. It is easy and fast to find correct ursl
by sitemap crawler. The software will make following features possible.

 1. sitemap detection: Sitemap files will be detected automatically, if available.
 * sitemap list injection: Sitemap urls will be injected by using Nutch injection
 * “Change frequence” mechanism must be supported by the crawler.
 * Supporting multi-sitemap.
 * Sitemap constraint: The maximum sitemap size can not be greater than 10 MB and the maximum
urls can not be greater than 50,000 in a sitemap file.
 * Sitemaps must have only inlink. Outlinks must be ignored.
 * Sitemap crawler is the part of Nutch Life Cycle [3]. Sitemap crawler is designed according
to these cases:
   * Sitemap urls can be injected from seedlist.
   * Sitemap files can be detected automatically from sites crawled.
   * It can be wanted to crawl only sitemaps.
   * It can be wanted to crawl urls except sitemap.
   * A sitemap file can give reference another sitemap file.
   * Sitemap file can be in zip format.
   * Sitemap file may be larger than 50mb. In case of this some limits must be defined.
   * Sitemaps file may include more url than 50,000. In case of this some limits must be defined.

=== The advatages of the process of developing project ===

 1. The new features that will be developed can be entegrated easily thanks to the nutch pluginer
design and nutch life cycle.
 * The current nutch plugins can be used.
 * There are some studies about sitemap crawler in Nutch project (NUTCH-1741 [1], NUTCH-1465
[2]). The process improves by taking hand the weak and strong sides of the project

== How Nutch 2.x processes Sitemap? ==

There are two use cases supported in Nutch's Sitemap processing:
 1. Sitemaps are considered as "remote seed lists". Crawl administrators can prepare a list
of sitemap links and get only those sitemap pages. This suits well for targeted crawl of specific
hosts. The sitemap urls are directly injected, fetched and parsed, if pass parameter as “-sitemap”.
Nutch uses Crawler Commons Project for parsing sitemaps.
 2. For open web crawl, it is not possible to track each host and get the sitemap links manually.
Nutch would automatically get the sitemaps for all the hosts seen in the crawls and inject
the urls from sitemap to the crawldb, if pass parameter as “-stmDetect” when fetch . we
need a list of all hosts see throughout the duration of nutch crawl. Nutch's HostDb stores
all the hosts that were seen in the long crawl. Link to the robots.txt of these hosts is generated
by pre-pending "http://" or "https://" schemes to the hostname. Crawler Commons is used for
robots.txt parsing and thus get the sitemap links. These sitemap links are then processed
same as #1.

You can see sitemap crawler life cycle schema as follow:

View raw message