tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] Resolved: (TIKA-138) Ignore HTML style and script content
Date Tue, 08 Apr 2008 15:59:26 GMT

     [ https://issues.apache.org/jira/browse/TIKA-138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jukka Zitting resolved TIKA-138.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.2-incubating

Resolved in revision 645982. 

> Ignore HTML style and script content
> ------------------------------------
>
>                 Key: TIKA-138
>                 URL: https://issues.apache.org/jira/browse/TIKA-138
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: julien nioche
>            Assignee: Jukka Zitting
>             Fix For: 0.2-incubating
>
>
> The current parser used for HTML leaves code in the extracted text. 
> For instance in the page http://implicitweb.blogspot.com/ the CSS section
> <style id='page-skin-1' type='text/css'><!--
> /*
> * Blogger Template Style
> *
> * Sand Dollar
> * by Jason Sutter
> * Updated by Blogger Team
> *//* Variable definitions
> ====================
> <Variable name="textcolor" description="Text Color"
> type="color" default="#000"><Variable name="bgcolor" description="Page Background
Color"
> type="color" default="#f6f6f6"><Variable name="pagetitlecolor" description="Blog
Title Color"
> type="color" default="#F5DEB3"><Variable name="pagetitlebgcolor" description="Blog
Title Background Color"
> type="color" default="#DE7008"><Variable name="descriptionColor" description="Blog
Description Color"
> type="color" default="#9E5205" /><Variable name="descbgcolor" description="Description
Background Color"
> type="color" default="#F5E39e"><Variable name="titlecolor" description="Post Title
Color"
> type="color" default="#9E5205"><Variable name="datecolor" description="Date Header
Color"
> type="color" default="#777777"><Variable name="footercolor" description="Post Footer
Color"
> ....
> is found in the extracted text. This is not the case when saving the same page as txt
from Firefox or OpenOffice.
> J.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message