nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank McCown <fmcc...@harding.edu>
Subject Support for Sitemap Protocol and Canonical URLs
Date Mon, 16 Feb 2009 17:28:55 GMT
I'm teaching a search engine course for CS undergrads, and we'd like
to make a contribution to Nutch.  It appears that Nutch does not
support the Sitemap Protocol (NUTCH-158).

http://sitemaps.org/

So I wanted to check with you all and see if this is something you
think would make a good addition.  Also, do you think this would be a
good project for a team of 3 undergrad students who need to complete
it within 2-3 weeks?  Being only modestly familiar with the codebase
myself, I don't want to assign a project that would be too difficult
or overwhelming for undergraduates who are newbies and have only been
writing Java code for a few semesters.

Also you may have heard of the new rel="canonical" attribute which is
now being supported by Google, Yahoo, and Live:

http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html

I'd like my students to modify Nutch to support this new attribute as well.

After I get some feedback, I'll submit a request to JIRA.  I was
wondering though, would it be better to submit it as an issue for 0.9,
1.0, or 1.1?

Thanks,
Frank

-- 
Frank McCown, Ph.D.
Assistant Professor of Computer Science
Harding University
http://www.harding.edu/fmccown/

Mime
View raw message