tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ken Krugler (JIRA)" <j...@apache.org>
Subject [jira] Commented: (TIKA-287) HtmlParser should resolve relative paths in <a href="xxx"> elements
Date Wed, 14 Oct 2009 23:16:31 GMT

    [ https://issues.apache.org/jira/browse/TIKA-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12765806#action_12765806

Ken Krugler commented on TIKA-287:

[hmm, where did my comment go? Retyping]

Wish I had time to submit a patch. But the code I used is:

1. Use incoming CONTENT_LOCATION in metadata to set up base URL.
2. Watch for <base> element in head, update base with the cleaned up href.
3. When you get an <a> element, use the cleaned up href in a call to a URL relative
4. Always trim the href you get, and strip out any CR/LF chars.
5. Attached is an example of the URL resolver code w/tests. Not formatted properly, and should
use a pattern with lower-case insensitive matching if you want to pass unnormalized URLs to
the routine.

Hope this helps...Ken

> HtmlParser should resolve relative paths in <a href="xxx"> elements
> -------------------------------------------------------------------
>                 Key: TIKA-287
>                 URL: https://issues.apache.org/jira/browse/TIKA-287
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.4
>            Reporter: Ken Krugler
>            Assignee: Jukka Zitting
>         Attachments: UrlUtils.java, UrlUtilsTest.java
> Currently clients of the HtmlParser need to manually keep track of the appropriate base
URL to use when resolving relative URLs in href="xxx" attributes.
> The parser should use the metadata RESOURCE_NAME_KEY value as the base.
> The parser should also watch for a <base> element in the <head> section,
and use that to update the base URL.
> Note that special care must be taken to work around a known bug in the Java URL() class,
when the relative URL is a query string and the base URL doesn't end with a '/'.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message