nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefan Grroschupf (JIRA)" <>
Subject [jira] Created: (NUTCH-18) Windows servers include illegal characters in URLs
Date Sat, 26 Mar 2005 14:33:19 GMT
Windows servers include illegal characters in URLs

         Key: NUTCH-18
     Project: Nutch
        Type: Bug
  Components: fetcher  
    Reporter: Stefan Grroschupf
    Priority: Minor

Transfered from:
submitted by:
Ken Meltsner

While spidering our intranet, I found that IIS may include 
illegal characters in URLs -- specifically, characters with 
the high bit set to produce non-English letters. In 
addition, both Firefox and IE will accept URLs with high-
bit characters, but Java won't.

While this may not be Nutch's (or Java's) fault, it would 
help if high-bit characters (and other illegal characters) 
in URLs could be escaped (using percent-hex notation) 
as part of the URL fix-up process, probably right after 
the hostname lower-case conversion.

Example document name in Portuguese(with high-bit 
characters) taken from a longer URL:


and with percent-escaped characters:


This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
If you want more information on JIRA, or have a bug to report see:

View raw message