tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1134) ContentHandler gets ignorable whitespace for <br> tags when parsing HTML
Date Fri, 09 Aug 2013 13:31:48 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13734769#comment-13734769

Uwe Schindler commented on TIKA-1134:

Hoss: I agree to fix this in the documentation.

On the SOLR-4679 i explained in more details *why TIKA is doing this*:

Let me recapitulate TIKA's problems:

- TIKA decided to use XHTML as its output format to report the parsed documents to the consumer.
This is nice, because it allows to preserve some of the formatting (like bold fonts, paragraphs,...)
originating from the original document. Of course most of this formatting is lost, but you
can still "detect" things like emphasized text. By choosing XHTML as output format, of course
TIKA must use XHTML formatting for new lines and similar. So whenever a line break is needed,
the TIKA pasrer emits a <br/> tag or places the "paragraph" (in a PDF) inside a <p/>
element. As we all know, HTML ignores formatting like newlines, tabs,... (all are treated
as one single whitespace, so means like this regreplace: {{s/\s+/ /}}
- On the other hand, TIKA wants to make it simple for people to extract the *plain text* contents.
With the XHTML-only approach this would be hard for the consumer. Because to add the correct
newlines, the consumer has to fully understand XHTML and detect block elements and replace
them by \n

To support both usages of TIKA the idea was to embed this information which is unimportant
to HTML (as HTML ignores whitespaces completely) as ignorableWhitespace as "convenience" for
the user. A fully compliant XHTML consumer would not parse the ignoreable stuff. As it understands
HTML it would detect a <p> element as a block element and format the output.

Solr unfortunately has some strange approach: It is mainly interested in the text only contents,
so ideally when consuming the HTLL it could use {{WriteoutContentHandler(StringBuilder, BodyContentHandler(parserConmtentHandler)}}.
In that case TIKA would do the right thing automatically: It would extract only text from
the body element and would use the "convenience whitespace" to format the text in ASCII-ART-like
way (using tabs, newlines,...) :-)
Solr has a hybrid approach: It collects all into a content tag (which is similar to the above
approcha), but the bug is that in contrast to TIKA's official WriteOutContentHandler it does
not use the ignorable whitespace inserted for convenience. In addition TIKA also has a stack
where it allows to process parts of the documents (like the title element or all <em>
elements). In that case it has several StringBuilders in parallel that are populated with
the contents. The problems are here too, but cannot be solved by using ignorable whitespace:
e.g. one indexes only all <em> elements (which are inline HTML elements no block elements),
there is no whitespace so all em elements would be glued together in the em field of your
index... I just mention this, in my opinion the SolrContentHandler needs more work to "correctly"
understand HTML and not just collect element names in a map!

Now to your complaint: You proposed to report the newlines as real {{character()}} events
- but this is not the right thing to do here. As I said, HTML does not know these characters,
they are ignored. The "formatting" is done by the element names (like <p>, <div>,
<table>). So the "helper" whitespace for text-only consumers should be inserted as ignorableWhitespace
only, if we would add it to the real character data we would report things that every HTML
parser (like nekohtml) would never report to the consumer. Nekohtml would also report this
useless extra whitespace as ignorable.

The convenience here is that TIKA's XHTMLContentHandler used by all parsers is "configured"
to help the text-only user, but don't hurt the HTML-only user. This differentiation is done
by reporting the HTML element names (p, div, table, th, td, tr, abbr, em, strong,...) but
also report the ASCII-ART-text-only content like TABs indide tables, newlines after block
elements,... This is always done as ignorableWhitespace (for convenience), a real HTML parser
must ignore it - and its correct to do this.

I think we should document this in the javadocs or the "howto" page, so implementors of ContentHandlers
know what to do!
> ContentHandler gets ignorable whitespace for <br> tags when parsing HTML
> ------------------------------------------------------------------------
>                 Key: TIKA-1134
>                 URL: https://issues.apache.org/jira/browse/TIKA-1134
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Hoss Man
>         Attachments: TIKA-1134.patch
> I'm not very knowledgable about Tika, so it's possible iI'm missunderstanding something
here, but it appears that the way Tika parses HTML to produce XHTML SAX events is missinterpreting
"<br>" tags as equivilent to ignorable whitespace containing a newline.  This means
that clients who ask Tika to parse files, and specify their own ContentHandler to capture
the character data can get sequences of run-on text w/o knowing that the "<br>" tag
was present -- _unless_ they explicitly handle ignorbaleWhitespace and treat it as "real"
whitespace -- but this creates a catch-22 if you really do want to ignore the ignorable whitespace
in the HTML markup.
> The crux of the problem seems to be:
>  * instead of generating a startElement event for "br" the HtmlParser treats it as a
>  * xhtml.newline() generates and ignorableWhitespace SAX event instead of a characters
SAX event
> ...either one of these by themselves might be fine, but in combination they don't really
make any sense.  If for example an actual newline exists in the html, it comes across as part
of a characters SAX event, not as ignorbale whitespace.
> Changing the newline() function to delegate to characters(...) seems to solve the problem
for <br> tags in HTML, but breaks several tests -- probably because the newline() function
is also used to add intentionally add (synthetic) ignorableWhitespace events after elements.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message