nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Escuer (JIRA)" <>
Subject [jira] [Commented] (NUTCH-18) Windows servers include illegal characters in URLs
Date Fri, 01 Apr 2011 14:31:05 GMT


David Escuer commented on NUTCH-18:

La persona amb la qui vol contactar estarà fora de les oficines de
SIMPPLE des del 30 de març fins al 7 d'abril, ambdós inclosos.

La persona con la que quiere contactar estará fuera de las oficinas de
SIMPPLE desde el 30 de marzo hasta el 7 de abril, ambos incluidos.

The person you are trying to reach will be out of the office from
march 30 until april 7 (both included).

> Windows servers include illegal characters in URLs
> --------------------------------------------------
>                 Key: NUTCH-18
>                 URL:
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>            Reporter: Stefan Groschupf
>            Priority: Minor
> Transfered from:
> submitted by:
> Ken Meltsner
> While spidering our intranet, I found that IIS may include 
> illegal characters in URLs -- specifically, characters with 
> the high bit set to produce non-English letters. In 
> addition, both Firefox and IE will accept URLs with high-
> bit characters, but Java won't.
> While this may not be Nutch's (or Java's) fault, it would 
> help if high-bit characters (and other illegal characters) 
> in URLs could be escaped (using percent-hex notation) 
> as part of the URL fix-up process, probably right after 
> the hostname lower-case conversion.
> Example document name in Portuguese(with high-bit 
> characters) taken from a longer URL:
> Nota%20tecnica%20-%20Alteração%20de%
> 20escopo.doc
> and with percent-escaped characters:
> Nota%20tecnica%20-%20Altera%e7%e3o%20de%
> 20escopo.doc

This message is automatically generated by JIRA.
For more information on JIRA, see:

View raw message