tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Eric Pugh (Jira)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2970) Configuring Tesseract for OCR of PDF via Tika Config is not working
Date Sun, 20 Oct 2019 20:38:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955615#comment-16955615
] 

David Eric Pugh commented on TIKA-2970:
---------------------------------------

Interestingly, I think this might all work in the Tika Server mode...   There is a method
fillParseContext that does populate the `parseContext` with the configured files:

https://github.com/apache/tika/blob/master/tika-server/src/main/java/org/apache/tika/server/resource/TikaResource.java#L172

Nothing, I think, like it in the path of calls from TikaApp

> Configuring Tesseract for OCR of PDF via Tika Config is not working
> -------------------------------------------------------------------
>
>                 Key: TIKA-2970
>                 URL: https://issues.apache.org/jira/browse/TIKA-2970
>             Project: Tika
>          Issue Type: Improvement
>          Components: ocr
>    Affects Versions: 1.22
>            Reporter: David Eric Pugh
>            Priority: Critical
>
> Based on TIKA-2705, I thought I could eliminate the use of the properties files for configuring
PDF and OCR processing, and just use a tika-config.xml file.
> I believe I have a unit test that demonstrates that if you need to override the tesseract
path for OCR, you end up always with the default Tesseract configuration, which leads to Tika
throwing an error: https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java#L328
  
> In stepping through the code, it seems like every time we consult the context:
> ```
> TesseractOCRConfig tesseractConfig =
>                 context.get(TesseractOCRConfig.class, DEFAULT_TESSERACT_CONFIG);
> ```
> We always get back the default.  The context never has our customized TesseractOCRConfig!
  Despite the fact that when we load up the TikaConfig in the first case, I notice that we
do create a TesseractOCRParser object WITH the various parameters...   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message