nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sami Siren <>
Subject Re: svn commit: r485076 - in /lucene/nutch/trunk/src: java/org/apache/nutch/metadata/ test/org/apache/nutch/metadata/
Date Sat, 09 Dec 2006 23:53:05 GMT
Chris Mattmann wrote:
> Hi Sami,
> On 12/9/06 2:27 PM, "" <> wrote:
>> Author: siren
>> Date: Sat Dec  9 14:27:07 2006
>> New Revision: 485076
>> URL:
>> Log:
>> Optimize SpellCheckedMetadata further by taking into account the fact that it
>> is used only for http-headers.
>> I am starting to believe that spellchecking should just be an utility method
>> used by http protocol plugins.
> I think that right now I'm -1 for this change. I would make note of all the
> comments on NUTCH-139, from which this code was born. In the end, I think
> what we all realized was that the spell checking capabilities is necessary,
> but not everywhere, as you point out. However, I don't think it's limited
> entirely to HTTP headers (what you've currently changed the code to). I
> think it should be implemented as a protocol layer service, also providing
> spell checking support to other protocol plugins, like protocol-file, etc.,

In protocol file all headers are artificial an generated in nutch code
so if there's spelling mistake there then we should fix the code
generating the headers and not rely on spellchecking in the first place.

> where field headers run the risk of being misspelled as well. What's to stop
> someone from implementing protocol-file++ that returns different file header
> keys than that of protocol-file? Just b/c HTTP is the most pervasively used
> plugin right now, I think it's convenient to assume that only HTTP protocol
> field keys may need spell checking services.

If there's a real need for spell checking on other keys one can just add
more classes to the array no big deal.

 Sami Siren

View raw message