tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Issue in text extraction in Solr / Tika
Date Sat, 20 Aug 2011 15:32:49 GMT
On Sat, Aug 20, 2011 at 10:19 AM, Uwe Schindler <uwe@thetaphi.de> wrote:
>> Hmm, actually: the <p> element allows text, in addition to child elements?
> So
>> shouldn't any whitespace within the <p>...</p> be treated as significant
> (part of
>> the content)?
>
> This is very indeed very complicated. For mixed content elements, the
> whitespace inside is preserved, but not next to child elements - very stupid
> rules. If you once coded HTML you know this :-)

Hmm... are you sure? :)

Because, I've tried Firefox and Chrome and Safari, on the xml file,
and all insert a space in rendering.

Also, I tried Tika itself (feeding back the .xml it had created, to
produce text) and it also inserts a space.

I also tried JTidy and it inserts the space though it thinks it's
parsing HTML so that may be an invalid test.

Anyway... even if the strict XML white space rules state that this
newline should not be counted as whitespace in the content, because so
many tools seem not to do it correctly.... I think it's worth trying
to fix Tika to not add this newline.

Mike McCandless

http://blog.mikemccandless.com

Mime
View raw message