nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Update of "GoogleSummerOfCode/SitemapCrawler/weeklyreport" by CihadGuzel
Date Sun, 23 Aug 2015 11:41:16 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "GoogleSummerOfCode/SitemapCrawler/weeklyreport" page has been changed by CihadGuzel:
https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler/weeklyreport?action=diff&rev1=11&rev2=12

Comment:
Weekly repor have updated

  
  
  = Week : 5 (22 June 2015 - 28 June 2015) =
- ...
  
+ '''Title :''' DbUpdater is updated 
+ 
+ DbUpdaterJob is updated for sitemap. Detected sitemaps are written to crawldb as a new line.
Then the sitemaps will be crawled at the new crawl cycle.
+ 
+ = Week : 6 & 7 (29 June 2015 - 12 July 2015) =
+ 
+ '''Title :''' Sitemap parse plugin was abondoned. 
+ 
+ Parser plugin was abandoned after consultation with mentors. The parse process was embedded
instead of plugin. Sitemap parser will be activated according to the parameters given as "sitemap".
+ Also midterm report is prepared. Up to this stage, sitemap life cycle has been developed
according to the outline. Sitemap crawler runs simply. The process until now and from now
on have evaluated.
+ 
+ 
+ = Week : 8 (13 July 2015 - 19 July 2015) =
+ 
+ '''Title :''' Sitemap file detection 
+ 
+ Sitemap file detection is implemented. The detection is activated according to the parameters
given  at instant of fetch.
+ 
+ = Week : 9 (20 July 2015 - 26 July 2015) =
+ 
+ '''Title :''' frequency & priority
+ 
+ Create processSitemapParse function on ParseUtil. Parser process is updated for sitemap.
Fetch interval time is updated acording to frequency value from sitemap.
+ Also priority field is added to crawldb for priority value from sitemap.
+ 
+ 
+ = Week : 10 & 11 (27 July 2015 - 9 August 2015) =
+ 
+ '''Title :''' Review & code cleaning
+ 
+ Some improvements were made according to the review of my mentor. Code cleaning is done.
Sitemap score logic isn't developed, because current nutch score logic is affected. It can
be done  according to the evaluation about it later.
+ 
+ = Week : 12 (10 August 2015 - 17 August 2015) =
+ 
+ '''Title :''' Testing
+ 
+ Some of problems have been fixed in the nutch test classes. Sitemap Tests were prepared.
 Documents of sitemap crawler were prepared.
+ 
+ = Week : 13 (18 August 2015 - 21 August 2015) =
+ 
+ '''Title :''' Final evaluation
+ 
+ The final document were prepared.
+ 

Mime
View raw message