nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: Support for Sitemap Protocol and Canonical URLs
Date Tue, 17 Feb 2009 07:58:03 GMT
Frank McCown wrote:
> I'm teaching a search engine course for CS undergrads, and we'd like
> to make a contribution to Nutch.  It appears that Nutch does not
> support the Sitemap Protocol (NUTCH-158).
> 
> http://sitemaps.org/

Correct.


> So I wanted to check with you all and see if this is something you
> think would make a good addition.  Also, do you think this would be a
> good project for a team of 3 undergrad students who need to complete
> it within 2-3 weeks?  Being only modestly familiar with the codebase
> myself, I don't want to assign a project that would be too difficult
> or overwhelming for undergraduates who are newbies and have only been
> writing Java code for a few semesters.

I think it would be a welcome addition. The question is more about 
whether the students are prepared to go through a few rounds of review 
and polishing the code so that it's fit for committing.

> Also you may have heard of the new rel="canonical" attribute which is
> now being supported by Google, Yahoo, and Live:
> 
> http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html
> 
> I'd like my students to modify Nutch to support this new attribute as well.

This sounds like a useful addition, too.

One important note: we are in the process of re-thinking the Nutch 
architecture, so it's likely that after 1.0 release is out the door we 
will concentrate on a heavy redesign.

For this reason it would best if this new functionality could be well 
separated from existing classes, e.g.in utility classes, or in an 
extension point that other existing Nutch classes can use.


> 
> After I get some feedback, I'll submit a request to JIRA.  I was
> wondering though, would it be better to submit it as an issue for 0.9,
> 1.0, or 1.1?

1.1. We are putting final touches to 1.0, and new development will 
happen only on the trunk.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Mime
View raw message