tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] Updated: (TIKA-138) Ignore HTML style and script content
Date Tue, 08 Apr 2008 15:55:27 GMT

     [ https://issues.apache.org/jira/browse/TIKA-138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Jukka Zitting updated TIKA-138:

    Assignee: Jukka Zitting
     Summary: Ignore HTML style and script content  (was: Better HTML parsing)

Good point. As discussed recently on the mailing list, there are probably some cases where
style and script content is useful for a Tika client, but by default the extracted text should
match what is normally shown by a browser.

> Ignore HTML style and script content
> ------------------------------------
>                 Key: TIKA-138
>                 URL: https://issues.apache.org/jira/browse/TIKA-138
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: julien nioche
>            Assignee: Jukka Zitting
> The current parser used for HTML leaves code in the extracted text. 
> For instance in the page http://implicitweb.blogspot.com/ the CSS section
> <style id='page-skin-1' type='text/css'><!--
> /*
> * Blogger Template Style
> *
> * Sand Dollar
> * by Jason Sutter
> * Updated by Blogger Team
> *//* Variable definitions
> ====================
> <Variable name="textcolor" description="Text Color"
> type="color" default="#000"><Variable name="bgcolor" description="Page Background
> type="color" default="#f6f6f6"><Variable name="pagetitlecolor" description="Blog
Title Color"
> type="color" default="#F5DEB3"><Variable name="pagetitlebgcolor" description="Blog
Title Background Color"
> type="color" default="#DE7008"><Variable name="descriptionColor" description="Blog
Description Color"
> type="color" default="#9E5205" /><Variable name="descbgcolor" description="Description
Background Color"
> type="color" default="#F5E39e"><Variable name="titlecolor" description="Post Title
> type="color" default="#9E5205"><Variable name="datecolor" description="Date Header
> type="color" default="#777777"><Variable name="footercolor" description="Post Footer
> ....
> is found in the extracted text. This is not the case when saving the same page as txt
from Firefox or OpenOffice.
> J.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message