nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (NUTCH-566) Sun's URL class has bug in creation of relative query URLs
Date Sat, 26 Apr 2014 21:36:16 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sebastian Nagel resolved NUTCH-566.
-----------------------------------

    Resolution: Fixed

Was fixed by NUTCH-797 with version 1.4 (2.x will be patched soon), the problematic example
({{http://www.fleurie.org/entreprise.asp?id_entrep=111}}) is included in unit test (o.a.n.util.TestURLUtil).

> Sun's URL class has bug in creation of relative query URLs
> ----------------------------------------------------------
>
>                 Key: NUTCH-566
>                 URL: https://issues.apache.org/jira/browse/NUTCH-566
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8, 0.8.1, 0.9.0
>         Environment: MacOS X and Linux (CentOS 4.5) both
>            Reporter: Doug Cook
>            Priority: Minor
>             Fix For: 1.9
>
>         Attachments: RelativeURL.java
>
>
> I'm using 0.81, but this will affect all other versions as well.
> Relative links of the form "?blah" are resolved incorrectly. For example, with a base
URL of http://www.fleurie.org/entreprise.asp, and a relative link of "?id_entrep=111", Nutch
will resolve this pair to the link
> "http://www.fleurie.org/?id_entrep=111". No such URL exists, and all browsers I tried
will resolve the pair to "http://www.fleurie.org/entreprise.asp?id_entrep=111".
> I tracked this down to what could be called a bug in Sun's URL class. According to Sun's
spec, they parse the relative URL according to RFC 2396. But the original RFC for relative
links was RFC 1808, and the two RFCs differ in how they handle relative links beginning with
"?". Most browsers (Netscape/Mozilla, IE, Safari) implemented RFC 1808, and stuck with it
(for compatibility and also because the behavior makes more sense). Apparently even the people
that wrote RFC 2396 recognized that this was a mistake, and the specified behavior was changed
in RFC 3986 to match what browsers do. 
> For a discussion of this, see  http://gbiv.com/protocols/uri/rev-2002/issues.html#003-relative-query
> Sun's URL implementation, however, still implements RFC2396, as far as I can tell, and
is out of step with the rest of the world.
> This breaks link extraction on a number of sites.
> I implemented a simple workaround, which I'm attaching. It is a static method to create
URLs which behaves exactly as new URL(URL base, String relativePath), and I use it as a drop-in
replacement for that in DOMContentUtils, Javascript link extraction, etc. Obviously, it really
only matters wherever links are extracted. I haven't included the calling code from DOMContentUtils,
etc. because my local versions are largely rewritten, but it should be pretty obvious.
> I put it in the org.apache.nutch.net directory, but obviously feel free to move it to
another place if you feel it belongs there!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message