tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Issue in text extraction in Solr / Tika
Date Sat, 20 Aug 2011 12:35:30 GMT
One thing I still don't like is with the XML (-x) or XHTML (-h)
output, the result filtered output incorrectly splits up a word.  The
doc has:

    NAMITGOP SAHAD

But in the XML/XHTML it looks like this:

  <p>
  <b>NAMITGOP</b>
  <b> SAHA</b>
  <b>D</b>
  </p>

Ie SAHAD became SAHA and D, separated.

I think this is a bug and I think I know why it's happening... I'll
open an issue.

Mike McCandless

http://blog.mikemccandless.com

On Sat, Aug 20, 2011 at 6:40 AM, Michael McCandless
<lucene@mikemccandless.com> wrote:
> OK one correction: I ran the TikaCLI tool with the -T option, which
> extracts "main content only"; when I re-ran with the -t (lowercase)
> option, which outputs all plain text, then it looks like all text
> appears correctly (phew!).
>
> On moving to 0.9, that's your call -- I'm not sure what's changed
> since then, but presumably it is better than 0.8!
>
> Displaying the equivalent of "-t" from the TikaCLI tool seems like a
> good approach?  Especially because the XHTML output incorrectly breaks
> up the SAHAD from your document.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Sat, Aug 20, 2011 at 1:07 AM, nirnaydewan <nirnaydewan@gmail.com> wrote:
>> First of all thanks again Mike for helping me out.
>>
>> Yes, i have seen that, some text do get stripped out sometimes. Any idea as
>> to why this could be happening?
>>
>> I am using the bundled Solr 3.3.0 which comes with Tika 0.8. Should i move
>> to 0.9? if so how?
>>
>> Also i am storing this text only which i am trying to display. If the xhtml
>> produces the correct text, how do i store it instead?
>>
>>
>> Thanks
>>
>>
>> --
>> View this message in context: http://lucene.472066.n3.nabble.com/Issue-in-text-extraction-in-Solr-Tika-tp3267810p3269982.html
>> Sent from the Apache Tika - Development mailing list archive at Nabble.com.
>>
>

Mime
View raw message