nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1685) URLUtil.toUNICODE fails on IDNs
Date Mon, 23 Dec 2013 09:26:50 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13855518#comment-13855518
] 

Markus Jelsma commented on NUTCH-1685:
--------------------------------------

Looks like a duplicate.

> URLUtil.toUNICODE fails on IDNs
> -------------------------------
>
>                 Key: NUTCH-1685
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1685
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.7, 2.2.1
>         Environment: Java 7, OpenJDK 64-Bit, 1.7.0_25
>            Reporter: Sebastian Nagel
>             Fix For: 2.3, 1.8
>
>         Attachments: NUTCH-1685-2x-test.patch
>
>
> URLUtil.toUNICODE() fails on IDNs and returns null instead of the Unicode URL. The constructor
of URI obviously does not accept IDN host names. For {{http://www.xn--evir-zoa.com/}} the
constructor IDN() throws the exception:
> {code}
> java.net.URISyntaxException: Illegal character in hostname at index 11: http://www.├ževir.com/
> {code}
> Principally, IDN.toUnicode() can convert URLs (not only domain or host names). However,
it does not convert URLs with host part consisting of only two parts: {{http://xn--uni-tbingen-xhb.de/}}.
Is that the reason why we need URLUtil.toUNICODE() ?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message