tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Issue in text extraction in Solr / Tika
Date Sat, 20 Aug 2011 14:16:49 GMT
Yes,

the text-only output handler exactly uses those whitespace processing
guidelines and also inserts newlines at correct places according to block
elements like <p/>. The code was partially written by me, especially the
block element parts :-)

So if the text-only output is formatted correctly then the HTML whould be
fine too. Of course those useless splitting of formatting is mostly caused
by the orginal word document (happens mostly by the word editor, e.g. when
you click on "bold" then think, "oh I missed a character" and then make the
rest also bold. Depending on the order of actions, these sections of bold
text are not merged together. There is nothing TIKA is doing wrong it just
translates the formatting of the word/pdf document to XHTML.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Michael McCandless [mailto:lucene@mikemccandless.com]
> Sent: Saturday, August 20, 2011 3:25 PM
> To: dev@tika.apache.org
> Subject: Re: Issue in text extraction in Solr / Tika
> 
> Ahhh.... what threw me off was the browser rendering, which turns that
> newline into space so I see "SAHA D".
> 
> Hmm, actually: the <p> element allows text, in addition to child elements?
So
> shouldn't any whitespace within the <p>...</p> be treated as significant
(part of
> the content)?
> 
> I need to go learn XML's whitespace rules :)
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> On Sat, Aug 20, 2011 at 8:39 AM, Uwe Schindler <uwe@thetaphi.de> wrote:
> > From the xml point of view, its not separated. It's just in two
elements, but no
> whitespace in-between, according to parsing standards (see xml whitespace
> rules).
> >
> > Uwe
> > --
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, 28213 Bremen
> > http://www.thetaphi.de
> >
> >
> >
> > Michael McCandless <lucene@mikemccandless.com> schrieb:
> >
> > One thing I still don't like is with the XML (-x) or XHTML (-h)
> > output, the result filtered output incorrectly splits up a word. The
> > doc has:
> >
> > NAMITGOP SAHAD
> >
> > But in the XML/XHTML it looks like this:
> >
> > <p>
> > <b>NAMITGOP</b>
> > <b> SAHA</b>
> > <b>D</b>
> > </p>
> >
> > Ie SAHAD became SAHA and D, separated.
> >
> > I think this is a bug and I think I know why it's happening... I'll
> > open an issue.
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> > On Sat, Aug 20, 2011 at 6:40 AM, Michael McCandless
> > <lucene@mikemccandless.com> wrote:
> >> OK one correction: I ran the TikaCLI tool with the -T option, which
> >> extracts "main content only"; when I re-ran with the -t (lowercase)
> >> option, which outputs all plain text, then it looks like all text
> >> appears correctly (phew!).
> >>
> >> On moving to 0.9, that's your call -- I'm not sure what's changed
> >> since then, but presumably it is better than 0.8!
> >>
> >> Displaying the equivalent of "-t" from the TikaCLI tool seems like a
> >> good approach?  Especially because the XHTML output incorrectly
> >> breaks up the SAHAD from your document.
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >> On Sat, Aug 20, 2011 at 1:07 AM, nirnaydewan <nirnaydewan@gmail.com>
> wrote:
> >>> First of all thanks again Mike for helping me out.
> >>>
> >>> Yes, i have seen that, some text do get stripped out sometimes. Any
> >>> idea as to why this could be happening?
> >>>
> >>> I am using the bundled Solr 3.3.0 which comes with Tika 0.8. Should
> >>> i move to 0.9? if so how?
> >>>
> >>> Also i am storing this text only which i am trying to display. If
> >>> the xhtml produces the correct text, how do i store it instead?
> >>>
> >>>
> >>> Thanks
> >>>
> >>>
> >>> --
> >>> View this message in context:
> >>> http://lucene.472066.n3.nabble.com/Issue-in-text-extraction-in-Solr-
> >>> Tika-tp3267810p3269982.html Sent from the Apache Tika - Development
> >>> mailing list archive at Nabble.com.
> >>>
> >>
> >
> >


Mime
View raw message