manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konrad Holl <KH...@searchtechnologies.com>
Subject [Tika content extraction Content Transformation Component] Additional Options
Date Wed, 09 Mar 2016 14:45:18 GMT
Hi,

for a client project I needed to enable OCR for images inside PDFs. Unfortunately ManifoldCF
does not provide configuration options to handle this. It would be nice to have these options
for the Tika content extraction:


1.       Enable PDF image extraction for OCR: https://tika.apache.org/1.7/api/org/apache/tika/parser/pdf/PDFParserConfig.html#setExtractInlineImages%28boolean%29

2.       Set default language for tesseract: https://tika.apache.org/1.7/api/org/apache/tika/parser/ocr/TesseractOCRConfig.html#setLanguage%28java.lang.String%29

Thanks

-Konrad

KONRAD HOLL
Senior Technical Consultant

M +49 178 8855 553
F  +49 178 99 8855 553
Skype: konrad.holl

Search Technologies GmbH
Theodor-Heuss-Allee 112
60486 Frankfurt am Main

SEARCH TECHNOLOGIES
Find Better Answers.
www.searchtechnologies.com<http://www.searchtechnologies.com/>


Mime
View raw message