tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Luis Filipe Nassif (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2550) ToTextHandler includes <style/> element content
Date Mon, 03 Dec 2018 12:22:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16707113#comment-16707113
] 

Luis Filipe Nassif commented on TIKA-2550:
------------------------------------------

Sorry for late reply, [~tallison@apache.org]. Will it change behaviour of Html text extraction
with ToTextContentHandler? It is important to us (in forensic field) to index text contained
in script elements to look for malicious html files. I think it may be a not backward compatible
change...

But if I remember html script elements are being handled as embedded docs? So I am not sure
if this change will ignore scripts from html, could you clarify?

> ToTextHandler includes <style/> element content
> -----------------------------------------------
>
>                 Key: TIKA-2550
>                 URL: https://issues.apache.org/jira/browse/TIKA-2550
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Trivial
>             Fix For: 2.0.0, 1.20
>
>
> When using the ToTextHandler to process .java files, the <style/> element content
is included, e.g.:
> {noformat}
> testFile
> code {
> color: rgb(0,0,0); font-family: monospace; font-size: 12px; white-space: nowrap;
> }
> .java_plain {
> color: rgb(0,0,0);
> }
> .java_keyword {
> color: rgb(0,0,0); font-weight: bold;
> }
> .java_javadoc_tag {
> color: rgb(147,147,147); background-color: rgb(247,247,247); font-style: italic; font-weight:
bold;
> }
> h1 {
> font-family: sans-serif; font-size: 16pt; font-weight: bold; color: rgb(0,0,0); background:
rgb(210,210,210); border: solid 1px black; padding: 5px; text-align: center;
> }
> .java_type {
> color: rgb(0,44,221);
> }
> .java_literal {
> color: rgb(188,0,0);
> }
> .java_javadoc_comment {
> color: rgb(147,147,147); background-color: rgb(247,247,247); font-style: italic;
> }
> .java_operator {
> color: rgb(0,124,31);
> }
> .java_separator {
> color: rgb(0,33,255);
> }
> .java_comment {
> color: rgb(147,147,147); background-color: rgb(247,247,247);
> }
> testFile/*************************************************************************
>  *  Compilation:  javac HelloWorld.java
>  *  Execution:    java HelloWorld
>  *
>  *  Prints "Hello, World". By tradition, this is everyone's first program.
>  *
>  *************************************************************************/
> public class HelloWorld {
>     public static void main(String[] args) {
>         System.out.println("Hello, World");
>     }
> }
> {noformat}
> Is this what we want as the default behavior?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message