tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ken Krugler (Jira)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2928) Less than sign within tag boundaries considered as start of a new tag.
Date Thu, 22 Aug 2019 14:43:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913382#comment-16913382

Ken Krugler commented on TIKA-2928:

The issue isn't that this is "somewhat non-standard" HTML - it's broken HTML, as the '<'
character needs to be encoded as &lt;

Browsers are pretty good at detecting this situation and working around it. Tika uses TagSoup
under the hood to process (often broken) HTML, so it sounds like this is a limitation of that
library. It would be interesting if you could see how JSoup handles this same document, as
there's a [pending issue|https://issues.apache.org/jira/browse/TIKA-1599] to switch from TagSoup
to that library.

> Less than sign within tag boundaries considered as start of a new tag.
> ----------------------------------------------------------------------
>                 Key: TIKA-2928
>                 URL: https://issues.apache.org/jira/browse/TIKA-2928
>             Project: Tika
>          Issue Type: Bug
>          Components: parser, server
>    Affects Versions: 1.22
>            Reporter: Desmond David
>            Priority: Major
> So I have been attempting to parse some (somewhat non-standard) HTML documents using
Tika and I have observed that if the document contains a less-than sign (<) as part of
a tag's body, Tika parses it as the start of a new tag and eventually omits the rest of the
text in the final document, up to the point when the next newline is to be entered.
> For example, consider the following HTML snippet:
> {code:html}
> <tr ><td > GFR<60 = Chronic Kidney Disease, GFR<15 = Kidney Failure
</td></tr><tr ><td ></td></tr><tr ><td > ENZYMES
& BILIRUBIN</td></tr>{code}
> The result is:
> {code:java}
> {code}
> Here, the rest of the content after the first `GFR` gets omitted. Based on this observation
I think this means that the `<60`  and it's subsequent characters are getting interpreted
as part of a tag, and since are getting ignored. Then at some point, `</td></tr>`
is encountered which short-circuits the execution and starts processing the next line.
> This behaviour was observed using both, the Tika App and the Tika Server.
> I think expected behaviour should be that all text within data tags (p, td, etc.) should
be considered as raw text. Or at least Tika's behaviour should be configurable to be allowed
to do so.

This message was sent by Atlassian Jira

View raw message