nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Update of "GoogleSummerOfCode/SitemapCrawler" by CihadGuzel
Date Mon, 01 Jun 2015 20:10:41 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "GoogleSummerOfCode/SitemapCrawler" page has been changed by CihadGuzel:
https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler?action=diff&rev1=6&rev2=7

  ||'''Student :'''||||Cihad Güzel - cguzelg@gmail.com||
  ||'''Mentors :'''||||[[https://wiki.apache.org/nutch/LewisJohnMcgibbney|Lewis John McGibbney]],
[[https://wiki.apache.org/nutch/talat|Talat Uyarer]]||
  
- == Abstract ==
+ === Abstract ===
  
  The url’s can be got from only pages that were scanned before in nutch crawler system.
This method is expensive. Also, the degrees of importance and “change frequance” of these
urls are not known only guessed. But, it is possible to find the whole of urls in a up-to-date
sitemap file. For this reason, sitemap files in website should be crawled. Nutch project will
have that support of sitemap crawler thanks to this development.
  
- == Introduction ==
+ === Introduction ===
  
  Sitemap is a file guiding to crawl website in a better way and it has different file formats
(such as simple text format, xml format, rss 2.0, atom 0.3 & 1.0). 
  
@@ -23, +23 @@

   * Sitemap crawler can be followed by reporting the errors occuring during crawling. 
   * The management and configuration of sitemap crawler are under the control of user.
  
- == Project Details: ==
+ === Project Details: ===
  
  It is aimed to power nutch project by sitemap crawler support. The main target is to detect
the sitemap having correct urls and to be crawled. It is easy and fast to find correct ursl
by sitemap crawler. The software will make following features possible.
  
@@ -66, +66 @@

   * The current nutch plugins can be used.
   * There are some studies about sitemap crawler in Nutch project  (NUTCH-1741 [1], NUTCH-1465
[2]). The process improves by taking hand  the weak and strong sides of the project 
  
- == Timeline: ==
+ === Timeline: ===
  
  Project development process can be divided into two steps. Firstly, nutch crawler life cycle
will be updated for sitemap crawler. Sitemap will be crawled in a simple way before midterm.
  In the next stage, Other issues will be completed such as sitemap detection, filter &
ranking mechanizm, documentation and tests.
  
- ===== Pre-GSoC =====
- The studies and the comments on NUTCH-1741 [1] and NUTCH-1465 [2] will be followed. 
+   '''Pre-GSoC : ''' The studies and the comments on NUTCH-1741 [1] and NUTCH-1465 [2] will
be followed. 
  
    * Week1 (25May-31May): sitemap url injection will be done. 
    * Week2 (1June-7June): Sitemap detection will be done. FetcherJob will be updated for
  sitemap.
@@ -87, +86 @@

    * Week12-13 (10Agust-23Agust): Further refine tests and documentation for the whole project.
  
  
- ==== Features that will be developed after GSOC: ====
+   '''Features that will be developed after GSOC:''' Sitemap crawler report page, Sitemap
monitoring page, Video Sitemaps crawler.
  
- Sitemap crawler report page,
- Sitemap monitoring page.
- Video Sitemaps crawler.
- 
- ==== Reference: ====
+ === Reference: ===
  
   *[1] https://issues.apache.org/jira/browse/NUTCH-1741
   *[2] https://issues.apache.org/jira/browse/NUTCH-1465
@@ -101, +96 @@

  
  
  
- ==== Reports ====
+ === Reports ===
   *  [[https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler/week1|Week1 (25May-31May)]]
   *  [[https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler/week2|Week2 (1June-7June)]]
  
- ==== Documentation ====
+ === Documentation ===
  Documents will be added here.
  
- ==== Jira Issues ====
+ === Jira Issues ===
  
   * https://issues.apache.org/jira/browse/NUTCH-1741
   * https://issues.apache.org/jira/browse/NUTCH-1465

Mime
View raw message