tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1134) ContentHandler gets ignorable whitespace for <br> tags when parsing HTML
Date Thu, 08 Aug 2013 18:41:48 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733819#comment-13733819
] 

Hoss Man commented on TIKA-1134:
--------------------------------

The crux of my initial confusion and continuted concern about needing this documented so future
users avoid this same confusion comes from these types of statements Uwe made...

bq. ... consumers that are only interested in the plain text contents of parsed files, should
ignore all HTML syntax elements and just treat ignorableWhitespace as significant

...and over in SOLR-4679, uwe added...

bq. ... "ignoreable whitespace" is XML semantics only, in (X)HTML this does not exist (it
is handled differently, but is never reported by HTML parsers), so the idea in TIKA is to
"reuse" (its a bit "incorrect") the ignoreableWhitespace SAX event to report this "added whitespace".


As someone who is not a Tika expert, or an XHTML expert, or even an HTML expert -- i have
no way of knowing any of this information if i'm trying to build/maintain a custom ContentHandler
to parse out specific bits of information from arbitrary files.  

In this specific case, i'm maintaining a ContentHandler used in Tika that attempts to be very
generic and agnostic to the types of files that get parsed -- so even if I was an HTML expert
and understood that "ignoreable whitespace" isn't really an (X)HTML concept, i wouldn't know
if/when i should assume that was relevent in building a custom ContentHandler for Tika, because
all i have to go on is the general information that tika handles the arbitrary file parsing
for me and generates SAX events combined with the org.xml.sax.ContentHandler javadocs  --
which might then lead me to the XML specs explanation of what ignorableWHitespace is, which
lead me to (seemingly reasonably) assume that if Tika is taking care of parsing file type
$foo and mapping that to SAX events, i probably don't want ignorableWHitespace -- but the
truth is i do for some (all?) file types.

that part isn't clear in any docs i've seen, and should probably be made clear somewhere.
                
> ContentHandler gets ignorable whitespace for <br> tags when parsing HTML
> ------------------------------------------------------------------------
>
>                 Key: TIKA-1134
>                 URL: https://issues.apache.org/jira/browse/TIKA-1134
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Hoss Man
>         Attachments: TIKA-1134.patch
>
>
> I'm not very knowledgable about Tika, so it's possible iI'm missunderstanding something
here, but it appears that the way Tika parses HTML to produce XHTML SAX events is missinterpreting
"<br>" tags as equivilent to ignorable whitespace containing a newline.  This means
that clients who ask Tika to parse files, and specify their own ContentHandler to capture
the character data can get sequences of run-on text w/o knowing that the "<br>" tag
was present -- _unless_ they explicitly handle ignorbaleWhitespace and treat it as "real"
whitespace -- but this creates a catch-22 if you really do want to ignore the ignorable whitespace
in the HTML markup.
> The crux of the problem seems to be:
>  * instead of generating a startElement event for "br" the HtmlParser treats it as a
xhtml.newline().
>  * xhtml.newline() generates and ignorableWhitespace SAX event instead of a characters
SAX event
> ...either one of these by themselves might be fine, but in combination they don't really
make any sense.  If for example an actual newline exists in the html, it comes across as part
of a characters SAX event, not as ignorbale whitespace.
> Changing the newline() function to delegate to characters(...) seems to solve the problem
for <br> tags in HTML, but breaks several tests -- probably because the newline() function
is also used to add intentionally add (synthetic) ignorableWhitespace events after elements.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message