tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche (JIRA)" <j...@apache.org>
Subject [jira] Commented: (TIKA-463) HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link
Date Tue, 27 Jul 2010 22:13:17 GMT

    [ https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892958#action_12892958

Julien Nioche commented on TIKA-463:

Am very tempted to push things one step further and delegate the startElement() and endElement()
to the mappers so that users can do whatever they fancy in their custom mapper implementations.
In that case we'd probably not need mapSafeElement and mapSafeAttribute any longer. The patch
above gives the mappers access to the metadata.

For example, <a> have a special treatment in the HTMLHandler and we currently can't
get the rel attribute in from <a href="http://www.nutch.org" rel="nofollow">, which
for a crawler is quite an embarrassment. Instead, by delegating the logic to the mappers we
get total control on what can be done while at the same time remain able to keep the existing
behaviour by default. 

Any reason not to delegate start/endElement to the mappers? It would be good to get some feedback
on this, as I really need to improve the  handling of HTML for Nutch :-)

> HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link
> ---------------------------------------------------------------------------------
>                 Key: TIKA-463
>                 URL: https://issues.apache.org/jira/browse/TIKA-463
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>         Attachments: TIKA-463.patch
> All of the listed HTML elements can have URLs as attributes, and thus we'd want to extract
those links, if possible.
> For elements that aren't valid as XHTML 1.0, there might be some challenges in the right
way to handle this.
> But if XHTML 1.0 means the union of "transitional and frameset" variants, then all of
the above are valid, and thus should be emitted by the parser,

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message