nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Groschupf ...@media-style.com>
Subject Re: [jira] Commented: (NUTCH-135) http header meta data are case insensitive in the real world (e.g. Content-Type or content-type)
Date Sat, 10 Dec 2005 14:41:00 GMT
Jack,
sorry there are now 3kb more in the patch :), please give it another  
try.
Stefan


Am 10.12.2005 um 15:30 schrieb Jack Tang:

> Stefan
>
> It seemed your patch missing
> org.apache.nutch.protocol.ContentProperties class, right?
>
> /Jack
>
> On 12/10/05, Stefan Groschupf (JIRA) <jira@apache.org> wrote:
>>     [ http://issues.apache.org/jira/browse/NUTCH-135? 
>> page=comments#action_12360025 ]
>>
>> Stefan Groschupf commented on NUTCH-135:
>> ----------------------------------------
>>
>> Andrzej, that is easy to add to the ContentProperties object and  
>> sure I can do that. However first I would love to get a OK for  
>> this patch, before I invest more time in it, since I spend to many  
>> time writing stuff just for the issue archive.
>> As soon this patch is in the sources I will write a small new  
>> patch (as Doug suggested, do it in small steps) to solve NUTCH-3
>>
>>> http header meta data are case insensitive in the real world  
>>> (e.g. Content-Type or content-type)
>>> -------------------------------------------------------------------- 
>>> ----------------------------
>>>
>>>          Key: NUTCH-135
>>>          URL: http://issues.apache.org/jira/browse/NUTCH-135
>>>      Project: Nutch
>>>         Type: Bug
>>>   Components: fetcher
>>>     Versions: 0.7, 0.7.1
>>>     Reporter: Stefan Groschupf
>>>     Priority: Critical
>>>      Fix For: 0.8-dev, 0.7.2-dev
>>>  Attachments: contentProperties_patch.txt
>>>
>>> As described in issue nutch-133, some webservers return http  
>>> header meta data not standard conform case insensitive.
>>> This provides many negative side effects, for example query thet  
>>> content type from the meta data return null also in case the  
>>> webserver returns a content type, but the key is not standard  
>>> conform e.g. lower case. Also this has effects to the pdf parser  
>>> that queries the content length etc.
>>
>> --
>> This message is automatically generated by JIRA.
>> -
>> If you think it was sent incorrectly contact one of the  
>> administrators:
>>    http://issues.apache.org/jira/secure/Administrators.jspa
>> -
>> For more information on JIRA, see:
>>    http://www.atlassian.com/software/jira
>>
>>
>
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>

---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message