tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2106) "hocr" case on Linux fails, but works on OSX. Related to TIKA-2093
Date Sat, 01 Oct 2016 03:57:20 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15537818#comment-15537818

Hudson commented on TIKA-2106:

SUCCESS: Integrated in Jenkins build tika-2.x #155 (See [https://builds.apache.org/job/tika-2.x/155/])
TIKA-2106 -- need to lower case hocr/txt suffix, thanks to Eric Pugh. (tallison: rev 1ab6c81cef1497e81d030d99195df1e479e0644d)
* (edit) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java

> "hocr" case on Linux fails, but works on OSX.  Related to TIKA-2093
> -------------------------------------------------------------------
>                 Key: TIKA-2106
>                 URL: https://issues.apache.org/jira/browse/TIKA-2106
>             Project: Tika
>          Issue Type: Bug
>          Components: ocr
>         Environment: Bug in Linux, but fine in OSX.
>            Reporter: Eric Pugh
>            Assignee: Tim Allison
> We pass a output type, either TXT or HOCR to the Tesseract command line.   When we call
the command line we lowercase it to "txt" or "hocr".  However, when we read back in the output,
we don't lower case it.  on OSX the constructed file path "output.HOCR" is actually found,
but in Linux it doesn't.  This patch lower cases the HOCR to hocr and TXT to txt in the constructed
file path.
> I didn't write a unit test as I don't have a good linux env to test it in, but I was
able to put a patched version of the Tika Parser Jar into my Docker Build to test it works.

This message was sent by Atlassian JIRA

View raw message