tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (Jira)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-3002) Possible bug with OCR strategy AUTO
Date Mon, 02 Dec 2019 21:21:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-3002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16986379#comment-16986379
] 

Hudson commented on TIKA-3002:
------------------------------

SUCCESS: Integrated in Jenkins build tika-branch-1x #287 (See [https://builds.apache.org/job/tika-branch-1x/287/])
TIKA-3002 -- fix bug in OCR AUTO mode (tallison: [https://github.com/apache/tika/commit/f67a83444036d4fb5b23e9000f06434bfb58eefc])
* (edit) tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
* (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java


> Possible bug with OCR strategy AUTO
> -----------------------------------
>
>                 Key: TIKA-3002
>                 URL: https://issues.apache.org/jira/browse/TIKA-3002
>             Project: Tika
>          Issue Type: Bug
>          Components: ocr, parser
>    Affects Versions: 1.22
>            Reporter: Patrick Herber
>            Priority: Major
>
> For performance reasons, I would like to activate the OCR scanning only when necessary.
I therefore tried to set the OCR strategy to "AUTO".
> However, I see that also for "normal" PDF files (where no OCR should be required), OCR
is performed and this not also slows down the application but (more important) results in
doubling the resulting text.
> Trying to understand how this works, I think I may have found a possible error in the
class *AbstractPDF2XHTML*. There, in case of selected OCR Strategy AUTO, on line 404 the
total number of characters found on the page is checked: if this is lower than 10 OCR is
performed.
> {code:java}
> } else if (config.getOcrStrategy().equals(PDFParserConfig.OCR_STRATEGY.AUTO)) {
>     //TODO add more sophistication
>     if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10) {
>         doOCROnCurrentPage();
>     }
> }
> {code}
> The logic is correct, but unfortunately at the beginning of the method (line 361 and
362) the two variables checked on this line are reset to 0, so this conditions is going to
be always true.
> I would suggest to move the reset of the two variables inside a finally block at the
end of the method.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message