tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1134) ContentHandler gets ignorable whitespace for <br> tags when parsing HTML
Date Thu, 08 Aug 2013 10:31:48 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733344#comment-13733344

Uwe Schindler commented on TIKA-1134:

Hi Hoss,
the "rule" in TIKA is:
- TIKA inserts ignoreableWhitespace to support plain-text extraction on block elements and
<br/> tags (which are also somehow "empty" block elements) - see TIKA-171. Nothing else
will insert ignorableWhitespace into the content handler. This means, consumers that are only
interested in the *plain text* contents of parsed files, should ignore all HTML syntax elements
and just treat ignorableWhitespace as significant - this is what TextOnlyContentHandler does
to extract text. This was decided in TIKA-171 long time ago. If you are interested in *structured*
HTML output, use the XHTML elements and ignore the whitespace.
> ContentHandler gets ignorable whitespace for <br> tags when parsing HTML
> ------------------------------------------------------------------------
>                 Key: TIKA-1134
>                 URL: https://issues.apache.org/jira/browse/TIKA-1134
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Hoss Man
>         Attachments: TIKA-1134.patch
> I'm not very knowledgable about Tika, so it's possible iI'm missunderstanding something
here, but it appears that the way Tika parses HTML to produce XHTML SAX events is missinterpreting
"<br>" tags as equivilent to ignorable whitespace containing a newline.  This means
that clients who ask Tika to parse files, and specify their own ContentHandler to capture
the character data can get sequences of run-on text w/o knowing that the "<br>" tag
was present -- _unless_ they explicitly handle ignorbaleWhitespace and treat it as "real"
whitespace -- but this creates a catch-22 if you really do want to ignore the ignorable whitespace
in the HTML markup.
> The crux of the problem seems to be:
>  * instead of generating a startElement event for "br" the HtmlParser treats it as a
>  * xhtml.newline() generates and ignorableWhitespace SAX event instead of a characters
SAX event
> ...either one of these by themselves might be fine, but in combination they don't really
make any sense.  If for example an actual newline exists in the html, it comes across as part
of a characters SAX event, not as ignorbale whitespace.
> Changing the newline() function to delegate to characters(...) seems to solve the problem
for <br> tags in HTML, but breaks several tests -- probably because the newline() function
is also used to add intentionally add (synthetic) ignorableWhitespace events after elements.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message