tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche (JIRA)" <j...@apache.org>
Subject [jira] Commented: (TIKA-463) HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link
Date Tue, 13 Jul 2010 11:26:49 GMT

    [ https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887716#action_12887716
] 

Julien Nioche commented on TIKA-463:
------------------------------------

creating a LinksHtmlMapper : +1, that would be a nice intermediate between the default mapper
and the identity mapper 

handling of links in mapper : mapSafeAttribute() returns a normalised representation of the
attribute names that are allowed but does not affect the value of the attributes. Maybe we
should change the method so that it returns BOTH the normalised name (or null of the attribute
must be skipped) and the corresponding normalised value (e.g. the resolved URL) given a name/value
couple. The mapper implementation could then manage the resolution of the URLs internally.
This would also be useful for normalising the names and values of elements in the header such
as http-equiv.

HtmlParser as an abstract class : what about following Jukka's suggestion for Handlers in
https://issues.apache.org/jira/browse/TIKA-458 and have a Factory?

As for frames, it raises another issue (see https://issues.apache.org/jira/browse/TIKA-457)
which is that anything outside <body> and <head> is currently discarded by the
HTMLMapper. This is why I considered doing TIKA-458 but maybe we could make the HTMLHandler
more generic and delegate the decisions to the Mappers e.g. by adding a method isBody(). 

The body level is currently used to : 
a) distinguish the elements in the header
b) determine where characters should be added to the text of the document

Do we really need (a)? Are elements such as LINK, BASE or META found anywhere outside the
HEAD? Should mapSafeElement() take into account the path of an element as well e.g. to allow
a <link> only if it has <head> for parent?




> HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link
> ---------------------------------------------------------------------------------
>
>                 Key: TIKA-463
>                 URL: https://issues.apache.org/jira/browse/TIKA-463
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>
> All of the listed HTML elements can have URLs as attributes, and thus we'd want to extract
those links, if possible.
> For elements that aren't valid as XHTML 1.0, there might be some challenges in the right
way to handle this.
> But if XHTML 1.0 means the union of "transitional and frameset" variants, then all of
the above are valid, and thus should be emitted by the parser,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message