Windows servers include illegal characters in URLs
--------------------------------------------------
Key: NUTCH-18
URL: http://issues.apache.org/jira/browse/NUTCH-18
Project: Nutch
Type: Bug
Components: fetcher
Reporter: Stefan Grroschupf
Priority: Minor
Transfered from:
http://sourceforge.net/tracker/index.php?func=detail&aid=1110243&group_id=59548&atid=491356
submitted by:
Ken Meltsner
While spidering our intranet, I found that IIS may include
illegal characters in URLs -- specifically, characters with
the high bit set to produce non-English letters. In
addition, both Firefox and IE will accept URLs with high-
bit characters, but Java won't.
While this may not be Nutch's (or Java's) fault, it would
help if high-bit characters (and other illegal characters)
in URLs could be escaped (using percent-hex notation)
as part of the URL fix-up process, probably right after
the hostname lower-case conversion.
Example document name in Portuguese(with high-bit
characters) taken from a longer URL:
Nota%20tecnica%20-%20Alteração%20de%
20escopo.doc
and with percent-escaped characters:
Nota%20tecnica%20-%20Altera%e7%e3o%20de%
20escopo.doc
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
http://www.atlassian.com/software/jira
|