tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Issue in text extraction in Solr / Tika
Date Sat, 20 Aug 2011 16:25:25 GMT
I found the source of the newline, and opened this issue:

    https://issues.apache.org/jira/browse/TIKA-692

Let's continue talking over there...

Mike McCandless

http://blog.mikemccandless.com

On Sat, Aug 20, 2011 at 12:11 PM, Uwe Schindler <uwe@thetaphi.de> wrote:
> Does it really add this newline, because this is strange? If you look at
> XHTMLContentHandler it does not. So the newline must come from somewhere
> else.
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
>> -----Original Message-----
>> From: Michael McCandless [mailto:lucene@mikemccandless.com]
>> Sent: Saturday, August 20, 2011 5:33 PM
>> To: dev@tika.apache.org
>> Subject: Re: Issue in text extraction in Solr / Tika
>>
>> On Sat, Aug 20, 2011 at 10:19 AM, Uwe Schindler <uwe@thetaphi.de> wrote:
>> >> Hmm, actually: the <p> element allows text, in addition to child
> elements?
>> > So
>> >> shouldn't any whitespace within the <p>...</p> be treated as
>> >> significant
>> > (part of
>> >> the content)?
>> >
>> > This is very indeed very complicated. For mixed content elements, the
>> > whitespace inside is preserved, but not next to child elements - very
>> > stupid rules. If you once coded HTML you know this :-)
>>
>> Hmm... are you sure? :)
>>
>> Because, I've tried Firefox and Chrome and Safari, on the xml file, and
> all insert
>> a space in rendering.
>>
>> Also, I tried Tika itself (feeding back the .xml it had created, to
> produce text)
>> and it also inserts a space.
>>
>> I also tried JTidy and it inserts the space though it thinks it's parsing
> HTML so
>> that may be an invalid test.
>>
>> Anyway... even if the strict XML white space rules state that this newline
> should
>> not be counted as whitespace in the content, because so many tools seem
> not
>> to do it correctly.... I think it's worth trying to fix Tika to not add
> this newline.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>
>

Mime
View raw message