tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2550) ToTextHandler includes <style/> element content
Date Mon, 03 Dec 2018 15:28:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16707366#comment-16707366
] 

Tim Allison commented on TIKA-2550:
-----------------------------------

[~lfcnassif], y, I was worried about breaking things, and I'm willing to revert this and find
a different solution.

I just added a unit test to confirm that script elements are still being extracted when the
HTMLParser is configured to extract them and the ToTextHandler is being used.  I also checked
legacy behavior, and scripts are not coming through in the ToTextHandler from htmls with scripts...so
there's no change in behavior there.

But still, this could break things...Let me know if I should revert this and create a new
handler or otherwise fix the extraction so that we're not getting style info in the "text"
for Java source files.

> ToTextHandler includes <style/> element content
> -----------------------------------------------
>
>                 Key: TIKA-2550
>                 URL: https://issues.apache.org/jira/browse/TIKA-2550
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Trivial
>             Fix For: 2.0.0, 1.20
>
>
> When using the ToTextHandler to process .java files, the <style/> element content
is included, e.g.:
> {noformat}
> testFile
> code {
> color: rgb(0,0,0); font-family: monospace; font-size: 12px; white-space: nowrap;
> }
> .java_plain {
> color: rgb(0,0,0);
> }
> .java_keyword {
> color: rgb(0,0,0); font-weight: bold;
> }
> .java_javadoc_tag {
> color: rgb(147,147,147); background-color: rgb(247,247,247); font-style: italic; font-weight:
bold;
> }
> h1 {
> font-family: sans-serif; font-size: 16pt; font-weight: bold; color: rgb(0,0,0); background:
rgb(210,210,210); border: solid 1px black; padding: 5px; text-align: center;
> }
> .java_type {
> color: rgb(0,44,221);
> }
> .java_literal {
> color: rgb(188,0,0);
> }
> .java_javadoc_comment {
> color: rgb(147,147,147); background-color: rgb(247,247,247); font-style: italic;
> }
> .java_operator {
> color: rgb(0,124,31);
> }
> .java_separator {
> color: rgb(0,33,255);
> }
> .java_comment {
> color: rgb(147,147,147); background-color: rgb(247,247,247);
> }
> testFile/*************************************************************************
>  *  Compilation:  javac HelloWorld.java
>  *  Execution:    java HelloWorld
>  *
>  *  Prints "Hello, World". By tradition, this is everyone's first program.
>  *
>  *************************************************************************/
> public class HelloWorld {
>     public static void main(String[] args) {
>         System.out.println("Hello, World");
>     }
> }
> {noformat}
> Is this what we want as the default behavior?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message