nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cihad Guzel <cguz...@gmail.com>
Subject GSOC- Sitemap support - final evolation
Date Thu, 13 Aug 2015 20:03:57 GMT
Hi all.

You know I am working for NUTCH-1741 for GSOC 2015. I have very little time
for the completion of final evolation for GSOC program. I want to talk
briefly about the process.

My goal is to give support sitemap project. I have almost completed my
work. I commit my code to my github account[1]. I attached the patch file
to the issue[2]. Features developed at this stage are as follows:

+ sitemap files are crawled (inject, generate,fetch and parse)
+ if a host have any sitemap files, they are detected.
+ If desired, only sitemap can be crawled or only other (non sitemap urls)
can be crawled.
+ It is activated with just one parameter (-sitemap).

Please follow the wiki[3] and issue[2] for more information.

Thanks for my mentors ( Lewis & Talat ) and for nutch community.

[1] - https://github.com/cguzel/nutch-sitemapCrawler
[2] - https://issues.apache.org/jira/browse/NUTCH-1741
[3] - https://wiki.apache.org/nutch/GoogleSummerOfCode/SitemapCrawler

--
Kind regards
Cihad Guzel

Mime
View raw message