From dev-return-3669-apmail-tika-dev-archive=tika.apache.org@tika.apache.org Tue Jul 13 11:27:18 2010 Return-Path: Delivered-To: apmail-tika-dev-archive@www.apache.org Received: (qmail 25744 invoked from network); 13 Jul 2010 11:27:18 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 13 Jul 2010 11:27:18 -0000 Received: (qmail 25581 invoked by uid 500); 13 Jul 2010 11:27:17 -0000 Delivered-To: apmail-tika-dev-archive@tika.apache.org Received: (qmail 25438 invoked by uid 500); 13 Jul 2010 11:27:15 -0000 Mailing-List: contact dev-help@tika.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@tika.apache.org Delivered-To: mailing list dev@tika.apache.org Received: (qmail 25422 invoked by uid 99); 13 Jul 2010 11:27:14 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Jul 2010 11:27:14 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Jul 2010 11:27:11 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o6DBQnMc018737 for ; Tue, 13 Jul 2010 11:26:49 GMT Message-ID: <5487191.345761279020409727.JavaMail.jira@thor> Date: Tue, 13 Jul 2010 07:26:49 -0400 (EDT) From: "Julien Nioche (JIRA)" To: dev@tika.apache.org Subject: [jira] Commented: (TIKA-463) HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link In-Reply-To: <9090641.332081278965749635.JavaMail.jira@thor> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887716#action_12887716 ] Julien Nioche commented on TIKA-463: ------------------------------------ creating a LinksHtmlMapper : +1, that would be a nice intermediate between the default mapper and the identity mapper handling of links in mapper : mapSafeAttribute() returns a normalised representation of the attribute names that are allowed but does not affect the value of the attributes. Maybe we should change the method so that it returns BOTH the normalised name (or null of the attribute must be skipped) and the corresponding normalised value (e.g. the resolved URL) given a name/value couple. The mapper implementation could then manage the resolution of the URLs internally. This would also be useful for normalising the names and values of elements in the header such as http-equiv. HtmlParser as an abstract class : what about following Jukka's suggestion for Handlers in https://issues.apache.org/jira/browse/TIKA-458 and have a Factory? As for frames, it raises another issue (see https://issues.apache.org/jira/browse/TIKA-457) which is that anything outside and is currently discarded by the HTMLMapper. This is why I considered doing TIKA-458 but maybe we could make the HTMLHandler more generic and delegate the decisions to the Mappers e.g. by adding a method isBody(). The body level is currently used to : a) distinguish the elements in the header b) determine where characters should be added to the text of the document Do we really need (a)? Are elements such as LINK, BASE or META found anywhere outside the HEAD? Should mapSafeElement() take into account the path of an element as well e.g. to allow a only if it has for parent? > HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link > --------------------------------------------------------------------------------- > > Key: TIKA-463 > URL: https://issues.apache.org/jira/browse/TIKA-463 > Project: Tika > Issue Type: Bug > Reporter: Ken Krugler > Assignee: Ken Krugler > > All of the listed HTML elements can have URLs as attributes, and thus we'd want to extract those links, if possible. > For elements that aren't valid as XHTML 1.0, there might be some challenges in the right way to handle this. > But if XHTML 1.0 means the union of "transitional and frameset" variants, then all of the above are valid, and thus should be emitted by the parser, -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.