nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefan Groschupf (JIRA)" <>
Subject [jira] Updated: (NUTCH-135) http header meta data are case insensitive in the real world (e.g. Content-Type or content-type)
Date Fri, 09 Dec 2005 21:14:08 GMT
     [ ]

Stefan Groschupf updated NUTCH-135:

    Attachment: contentProperties_patch.txt

As Doug suggested a patch using TreeMap String.CASE_INSENSITIVE_ORDER that solve the problem
of case insensitive http header or general case insensitve content meta data. 
In general I see  two different ways to solve the problem. First leave the API as it is and
extend a Properties object to overwriting its methods by using behind the sence a TreeMap.
This solution would also require to copy some data between the properties object and treemap
back and for several times, since the nutch code uses a Properties object in the content 
constructor. The other choice would be to change the API of the content object to cleanly
document that a other object, that has a different behavior than the properties object is
used. The negative thing on this solution is that there are many small changes in the nutch
code base. 
However I decide for a clean way, the last way, since I don't like code that does some things
behind the sence that  developers would not expect. So I introduced a tiny ContentProperties
object and changed the Content construtor to use the ContentProperties object instead of the
java.util.Properties object. The new ContentProperties has a similar API as the Properties
class but use case insensitve keys. I changed all classes that use the content object to use
the new ContentProperties until object instantiation and I also extend the Content test case
to test if case insensitive keys are now supported. 
Feel free to give constructive improvement suggestions, but also please let get us this done
as soon as possible since from my point of view this is a critical issue.  All testcases pass
on my box, but please double check before commiting.

> http header meta data are case insensitive in the real world (e.g. Content-Type or content-type)
> ------------------------------------------------------------------------------------------------
>          Key: NUTCH-135
>          URL:
>      Project: Nutch
>         Type: Bug
>   Components: fetcher
>     Versions: 0.7.1, 0.7
>     Reporter: Stefan Groschupf
>     Priority: Critical
>      Fix For: 0.8-dev, 0.7.2-dev
>  Attachments: contentProperties_patch.txt
> As described in issue nutch-133, some webservers return http header meta data not standard
conform case insensitive.
> This provides many negative side effects, for example query thet content type from the
meta data return null also in case the webserver returns a content type, but the key is not
standard conform e.g. lower case. Also this has effects to the pdf parser that queries the
content length etc.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

View raw message