tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Pugh <ep...@opensourceconnections.com>
Subject Need some guidance on how to proceed with TIKA-2970
Date Mon, 21 Oct 2019 12:46:37 GMT
Hi all,

I need some guidance on how to drive TIKA-2970 to conclusion.   I’ve created a unit test
that demonstrates that when you configure Tesseract via tika-config properties, that TesseractOCRConfig
is ignored when running Tika on the command line by other Parsers, but not when Tika is running
as a Server process ;-)

I’m hoping to get this fix in for 1.23, as it’ll make my deployment life much simpler
to have everything in one config file, and not have the .properties!

1) The approach I took was somewhat mimicking the extractInlineImagesFromPDFS() method, which
was to add another check:
Is this the best way?   I feel like one of the initialization methods should have worked,
but it seemed like I never could get access to the context object to put my custom config.

2) The unit test actually runs the Tesseract process.  Thoughts on how to improve the unit
test?   To be less of an integration test?

3) I coded against the master branch, is that the right way to do this?  Versus branch_1x.

4) Lastly, would we want to support the fillMetadata logic (https://github.com/apache/tika/blob/master/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java#L307)
in the command line version as well?   I don’t need it, and it feels like it might complicate
the parameters even more, but happy to take a stab at that if we want.



Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com
<http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

This e-mail and all contents, including attachments, is considered to be Company Confidential
unless explicitly stated otherwise, regardless of whether attachments are marked as such.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message