uima-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dale Lane (JIRA)" <...@uima.apache.org>
Subject [jira] [Created] (UIMA-5147) RUTA leaves the contents of STYLE tags in plaintext
Date Wed, 19 Oct 2016 12:18:58 GMT
Dale Lane created UIMA-5147:

             Summary: RUTA leaves the contents of STYLE tags in plaintext
                 Key: UIMA-5147
                 URL: https://issues.apache.org/jira/browse/UIMA-5147
             Project: UIMA
          Issue Type: Bug
          Components: Ruta
            Reporter: Dale Lane

I'm using RUTA HtmlAnnotator and HtmlConverter to turn an HTML document into the plain text
extracted from it, with annotations to represent the markup that were in the original HTML.

The contents of <STYLE> tags are showing up in the plaintext view, which isn't helpful.
As STYLE isn't part of the document contents, I think it'd be better for this not to be added
to plaintext, or at least for there to be an option to allow this to be excluded. 

(Apologies if I've missed a way to do this using the existing options)

As an example of a simple recreate, a document like this can be used:
        /*  */
        .test {
            text-align: left;
</head><body>Hello world</body></html>

This message was sent by Atlassian JIRA

View raw message