manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Tika Extractor - extract document as (X)HTML not as textonly
Date Fri, 03 Jan 2020 09:57:57 GMT
Thanks Karl a lot. I will look into this. 


> Am 03.01.2020 um 10:04 schrieb Karl Wright <daddywri@gmail.com>:
> 
> 
> The reason plain text is used is because otherwise standard text processing inside Lucene
will index tags as terms, which is definitely not what you usually want.
> 
> If you want the Tika Extractor to be able to optionally generate an XHTML format, that
sounds like an additional operating mode for the Tika Extractor.  To do that you'd need to
add a flag, probably to the Output Specification, with associated UI components, and be sure
to maintain backwards compatibility.
> 
> Karl
> 
> 
>> On Thu, Jan 2, 2020 at 4:30 PM Jörn Franke <jornfranke@gmail.com> wrote:
>> Hi,
>> 
>> Is there a possibility to have instead of the text output in the Tika Extractor (Manifold
version, not the extract handler) the (X)HTML output? How one can achieve this in Tika is
pretty clear:
>> https://tika.apache.org/1.8/examples.html#Picking_different_output_formats
>> 
>> Reason: We need to extract very specific chapters from a word document and index
them as dedicated Solr documents (the latter part is probably still to be done in an update
chain).  There we currently already extract from the HTML version created by Tika of the word
document the (sub-)chapters we need.
>> 
>> thank you.
>> 
>> best regards

Mime
View raw message