tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Issue in text extraction in Solr / Tika
Date Sat, 20 Aug 2011 10:40:18 GMT
OK one correction: I ran the TikaCLI tool with the -T option, which
extracts "main content only"; when I re-ran with the -t (lowercase)
option, which outputs all plain text, then it looks like all text
appears correctly (phew!).

On moving to 0.9, that's your call -- I'm not sure what's changed
since then, but presumably it is better than 0.8!

Displaying the equivalent of "-t" from the TikaCLI tool seems like a
good approach?  Especially because the XHTML output incorrectly breaks
up the SAHAD from your document.

Mike McCandless

http://blog.mikemccandless.com

On Sat, Aug 20, 2011 at 1:07 AM, nirnaydewan <nirnaydewan@gmail.com> wrote:
> First of all thanks again Mike for helping me out.
>
> Yes, i have seen that, some text do get stripped out sometimes. Any idea as
> to why this could be happening?
>
> I am using the bundled Solr 3.3.0 which comes with Tika 0.8. Should i move
> to 0.9? if so how?
>
> Also i am storing this text only which i am trying to display. If the xhtml
> produces the correct text, how do i store it instead?
>
>
> Thanks
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Issue-in-text-extraction-in-Solr-Tika-tp3267810p3269982.html
> Sent from the Apache Tika - Development mailing list archive at Nabble.com.
>

Mime
View raw message