tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Issue in text extraction in Solr / Tika
Date Sat, 20 Aug 2011 16:11:00 GMT
Does it really add this newline, because this is strange? If you look at
XHTMLContentHandler it does not. So the newline must come from somewhere
else.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Michael McCandless [mailto:lucene@mikemccandless.com]
> Sent: Saturday, August 20, 2011 5:33 PM
> To: dev@tika.apache.org
> Subject: Re: Issue in text extraction in Solr / Tika
> 
> On Sat, Aug 20, 2011 at 10:19 AM, Uwe Schindler <uwe@thetaphi.de> wrote:
> >> Hmm, actually: the <p> element allows text, in addition to child
elements?
> > So
> >> shouldn't any whitespace within the <p>...</p> be treated as
> >> significant
> > (part of
> >> the content)?
> >
> > This is very indeed very complicated. For mixed content elements, the
> > whitespace inside is preserved, but not next to child elements - very
> > stupid rules. If you once coded HTML you know this :-)
> 
> Hmm... are you sure? :)
> 
> Because, I've tried Firefox and Chrome and Safari, on the xml file, and
all insert
> a space in rendering.
> 
> Also, I tried Tika itself (feeding back the .xml it had created, to
produce text)
> and it also inserts a space.
> 
> I also tried JTidy and it inserts the space though it thinks it's parsing
HTML so
> that may be an invalid test.
> 
> Anyway... even if the strict XML white space rules state that this newline
should
> not be counted as whitespace in the content, because so many tools seem
not
> to do it correctly.... I think it's worth trying to fix Tika to not add
this newline.
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com


Mime
View raw message