tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Issue in text extraction in Solr / Tika
Date Sat, 20 Aug 2011 13:25:10 GMT
Ahhh.... what threw me off was the browser rendering, which turns that
newline into space so I see "SAHA D".

Hmm, actually: the <p> element allows text, in addition to child
elements?  So shouldn't any whitespace within the <p>...</p> be
treated as significant (part of the content)?

I need to go learn XML's whitespace rules :)

Mike McCandless

http://blog.mikemccandless.com

On Sat, Aug 20, 2011 at 8:39 AM, Uwe Schindler <uwe@thetaphi.de> wrote:
> From the xml point of view, its not separated. It's just in two elements, but no whitespace
in-between, according to parsing standards (see xml whitespace rules).
>
> Uwe
> --
> Uwe Schindler
> H.-H.-Meier-Allee 63, 28213 Bremen
> http://www.thetaphi.de
>
>
>
> Michael McCandless <lucene@mikemccandless.com> schrieb:
>
> One thing I still don't like is with the XML (-x) or XHTML (-h)
> output, the result filtered output incorrectly splits up a word. The
> doc has:
>
> NAMITGOP SAHAD
>
> But in the XML/XHTML it looks like this:
>
> <p>
> <b>NAMITGOP</b>
> <b> SAHA</b>
> <b>D</b>
> </p>
>
> Ie SAHAD became SAHA and D, separated.
>
> I think this is a bug and I think I know why it's happening... I'll
> open an issue.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Sat, Aug 20, 2011 at 6:40 AM, Michael McCandless
> <lucene@mikemccandless.com> wrote:
>> OK one correction: I ran the TikaCLI tool with the -T option, which
>> extracts "main content only"; when I re-ran with the -t (lowercase)
>> option, which outputs all plain text, then it looks like all text
>> appears correctly (phew!).
>>
>> On moving to 0.9, that's your call -- I'm not sure what's changed
>> since then, but presumably it is better than 0.8!
>>
>> Displaying the equivalent of "-t" from the TikaCLI tool seems like a
>> good approach?  Especially because the XHTML output incorrectly breaks
>> up the SAHAD from your document.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Sat, Aug 20, 2011 at 1:07 AM, nirnaydewan <nirnaydewan@gmail.com> wrote:
>>> First of all thanks again Mike for helping me out.
>>>
>>> Yes, i have seen that, some text do get stripped out sometimes. Any idea as
>>> to why this could be happening?
>>>
>>> I am using the bundled Solr 3.3.0 which comes with Tika 0.8. Should i move
>>> to 0.9? if so how?
>>>
>>> Also i am storing this text only which i am trying to display. If the xhtml
>>> produces the correct text, how do i store it instead?
>>>
>>>
>>> Thanks
>>>
>>>
>>> --
>>> View this message in context: http://lucene.472066.n3.nabble.com/Issue-in-text-extraction-in-Solr-Tika-tp3267810p3269982.html
>>> Sent from the Apache Tika - Development mailing list archive at Nabble.com.
>>>
>>
>
>

Mime
View raw message