tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Uwe Schindler <...@thetaphi.de>
Subject Re: Issue in text extraction in Solr / Tika
Date Sat, 20 Aug 2011 12:39:22 GMT
>From the xml point of view, its not separated. It's just in two elements, but no whitespace
in-between, according to parsing standards (see xml whitespace rules).

Uwe Schindler
H.-H.-Meier-Allee 63, 28213 Bremen

Michael McCandless <lucene@mikemccandless.com> schrieb:

One thing I still don't like is with the XML (-x) or XHTML (-h)
output, the result filtered output incorrectly splits up a word. The
doc has:


But in the XML/XHTML it looks like this:

<b> SAHA</b>

Ie SAHAD became SAHA and D, separated.

I think this is a bug and I think I know why it's happening... I'll
open an issue.

Mike McCandless


On Sat, Aug 20, 2011 at 6:40 AM, Michael McCandless
<lucene@mikemccandless.com> wrote:
> OK one correction: I ran the TikaCLI tool with the -T option, which
> extracts "main content only"; when I re-ran with the -t (lowercase)
> option, which outputs all plain text, then it looks like all text
> appears correctly (phew!).
> On moving to 0.9, that's your call -- I'm not sure what's changed
> since then, but presumably it is better than 0.8!
> Displaying the equivalent of "-t" from the TikaCLI tool seems like a
> good approach?  Especially because the XHTML output incorrectly breaks
> up the SAHAD from your document.
> Mike McCandless
> http://blog.mikemccandless.com
> On Sat, Aug 20, 2011 at 1:07 AM, nirnaydewan <nirnaydewan@gmail.com> wrote:
>> First of all thanks again Mike for helping me out.
>> Yes, i have seen that, some text do get stripped out sometimes. Any idea as
>> to why this could be happening?
>> I am using the bundled Solr 3.3.0 which comes with Tika 0.8. Should i move
>> to 0.9? if so how?
>> Also i am storing this text only which i am trying to display. If the xhtml
>> produces the correct text, how do i store it instead?
>> Thanks
>> --
>> View this message in context: http://lucene.472066.n3.nabble.com/Issue-in-text-extraction-in-Solr-Tika-tp3267810p3269982.html
>> Sent from the Apache Tika - Development mailing list archive at Nabble.com.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message