>From the xml point of view, its not separated. It's just in two elements, but no whitespace
in-between, according to parsing standards (see xml whitespace rules).
Uwe
--
Uwe Schindler
H.-H.-Meier-Allee 63, 28213 Bremen
http://www.thetaphi.de
Michael McCandless <lucene@mikemccandless.com> schrieb:
One thing I still don't like is with the XML (-x) or XHTML (-h)
output, the result filtered output incorrectly splits up a word. The
doc has:
NAMITGOP SAHAD
But in the XML/XHTML it looks like this:
<p>
<b>NAMITGOP</b>
<b> SAHA</b>
<b>D</b>
</p>
Ie SAHAD became SAHA and D, separated.
I think this is a bug and I think I know why it's happening... I'll
open an issue.
Mike McCandless
http://blog.mikemccandless.com
On Sat, Aug 20, 2011 at 6:40 AM, Michael McCandless
<lucene@mikemccandless.com> wrote:
> OK one correction: I ran the TikaCLI tool with the -T option, which
> extracts "main content only"; when I re-ran with the -t (lowercase)
> option, which outputs all plain text, then it looks like all text
> appears correctly (phew!).
>
> On moving to 0.9, that's your call -- I'm not sure what's changed
> since then, but presumably it is better than 0.8!
>
> Displaying the equivalent of "-t" from the TikaCLI tool seems like a
> good approach? Especially because the XHTML output incorrectly breaks
> up the SAHAD from your document.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Sat, Aug 20, 2011 at 1:07 AM, nirnaydewan <nirnaydewan@gmail.com> wrote:
>> First of all thanks again Mike for helping me out.
>>
>> Yes, i have seen that, some text do get stripped out sometimes. Any idea as
>> to why this could be happening?
>>
>> I am using the bundled Solr 3.3.0 which comes with Tika 0.8. Should i move
>> to 0.9? if so how?
>>
>> Also i am storing this text only which i am trying to display. If the xhtml
>> produces the correct text, how do i store it instead?
>>
>>
>> Thanks
>>
>>
>> --
>> View this message in context: http://lucene.472066.n3.nabble.com/Issue-in-text-extraction-in-Solr-Tika-tp3267810p3269982.html
>> Sent from the Apache Tika - Development mailing list archive at Nabble.com.
>>
>
|