tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Pugh <ep...@opensourceconnections.com>
Subject Re: TesseractOCRParserTest needed extra parameters to run...
Date Tue, 20 Aug 2019 19:22:37 GMT
I poked around at other parsers for Tika that require additional installation steps to see
how they warn the user, like the GrobidNERecogniser class...   It turns out the way that is
handled is by NOT having a unit test at all ;-(

 

> On Aug 20, 2019, at 10:46 AM, Eric Pugh <epugh@opensourceconnections.com> wrote:
> 
> In order to get the TesseractOCRParserTest to run, having installed Tesseract on OSX
using “brew install tesseract”, I had to be explicit about the paths.
> 
> Any thoughts on how we could convey to a user that they might need to tweak the path
to run the unit tests?  I was thinking about adding some sort of messaging, but I don’t
know if that is a pattern that we have in Tika with these external dependencies?
> 
> Thoughts?
> 
> diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
b/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
> index 9ebcee068..32db2c442 100644
> --- a/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
> +++ b/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
> @@ -51,6 +51,7 @@ public class TesseractOCRParserTest extends TikaTest {
>  
>      public static boolean canRun() {
>          TesseractOCRConfig config = new TesseractOCRConfig();
> +        config.setTesseractPath("/usr/local/bin");
>          TesseractOCRParserTest tesseractOCRTest = new TesseractOCRParserTest();
>          return tesseractOCRTest.canRun(config);
>      }
> @@ -164,6 +165,8 @@ public class TesseractOCRParserTest extends TikaTest {
>                            BasicContentHandlerFactory.HANDLER_TYPE handlerType,
>                            TesseractOCRConfig.OUTPUT_TYPE outputType) throws Exception
{
>          TesseractOCRConfig config = new TesseractOCRConfig();
> +        config.setTesseractPath("/usr/local/bin");
> +        config.setTessdataPath("/usr/local/Cellar/tesseract/4.1.0/share/tessdata");
>          config.setOutputType(outputType);
>          
>          Parser parser = new RecursiveParserWrapper(new AutoDetectParser(),
> _______________________
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com
<http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>
 
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

> This e-mail and all contents, including attachments, is considered to be Company Confidential
unless explicitly stated otherwise, regardless of whether attachments are marked as such.
> 

_______________________
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com
<http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal>
 
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

This e-mail and all contents, including attachments, is considered to be Company Confidential
unless explicitly stated otherwise, regardless of whether attachments are marked as such.


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message