manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Tika Extractor - extract document as (X)HTML not as textonly
Date Thu, 02 Jan 2020 21:30:00 GMT
Hi,

Is there a possibility to have instead of the text output in the Tika
Extractor (Manifold version, not the extract handler) the (X)HTML output?
How one can achieve this in Tika is pretty clear:
https://tika.apache.org/1.8/examples.html#Picking_different_output_formats

Reason: We need to extract very specific chapters from a word document and
index them as dedicated Solr documents (the latter part is probably still
to be done in an update chain).  There we currently already extract from
the HTML version created by Tika of the word document the (sub-)chapters we
need.

thank you.

best regards

Mime
View raw message