nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cook (JIRA)" <>
Subject [jira] Commented: (NUTCH-436) Incorrect handling of relative paths when the embedded URL path is empty
Date Tue, 16 Oct 2007 15:39:50 GMT


Doug Cook commented on NUTCH-436:

It looks like Nutch-566, and associated patch, which I recently filed, is a duplicate of this.

The patch I proposed may or may not handle the ';' correctly, I need to check that.

But the patch for this issue (Nutch-436) is limited to DOMContentUtils, and this problem will
exist wherever Sun's URL class is used in URL extraction -- thus it affects any parser, not
just the HTML one. The same issue occurs in Javascript link extraction, Flash link extraction,
etc. -- thus the patch should be in a centralized location (like util).

> Incorrect handling of relative paths when the embedded URL path is empty
> ------------------------------------------------------------------------
>                 Key: NUTCH-436
>                 URL:
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>            Reporter: Andrew Groh
>            Assignee: Dennis Kubes
>            Priority: Critical
>         Attachments: NUTCH-436-20070304.patch
> If you have a base URL of the form:
> http://a/b/c/d;p?q#f
> Embedded URL: ?y
> Correct Absolute URL: http://a/b/c/d;p?y 
> Nutch Generated URL: http://a/b/c/?y
> Embedded URL: ;x
> Correct Absolute URL: http://a/b/c/d;x 
> Nutch Generated URL: http://a/b/c/;x
> See section 4, steps 5-7 of RFC 1808 for the definition of the correct set of steps,
and section 5.1 for example

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message