tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Petr Vas (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-93) OCR support
Date Tue, 19 Aug 2014 11:17:20 GMT

    [ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14102124#comment-14102124

Petr Vas commented on TIKA-93:

Ok. I have managed to get TerrasectOCRParser working through tika-server with custom tika-config.xml.

The only thing that I have had to change in code was initialization of TerrasectOCRConfig
in parse method (line 114 in TerrasectOCRParser.java). Instead of returning after getting
null config from ParseContext it initializes with new TesseractOCRConfig(). Line 114 in TerrasectOCRParser.java
looks like this:
{code:java}    		config = new TesseractOCRConfig();{code}
This made one of the test fail (testPPTXThumbnail in OOXMLParserTest) therefore this code
must not be sent merged further in main, but if fits pertectly for my personal aims.

I have also managed to make use of both PDFBox and TerrasectOCRParser parsers for PDFs by
disabling magic detection and binding PDFs that are to be OCRed to a specific MIME type (application/pdf-ocr).
I can share my tika-config.xml in case this is of interest. I can see that there is work being
done on making seamless integration between PDFBox and Terrasect as a part of GSoC 2014 (
PDFBOX-1912 ), but it is not over and it is not clear whether it would be ever over.

In general I am wondering about how can I define ParseContext in tika-server, so that I can
skip hacking code and make terrasect configurable outside of source code? Any ideas/pointers

> OCR support
> -----------
>                 Key: TIKA-93
>                 URL: https://issues.apache.org/jira/browse/TIKA-93
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.7
>         Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, TesseractOCRParser.patch,
TesseractOCRParser.patch, TesseractOCR_Tyler.patch, TesseractOCR_Tyler_v2.patch, testOCR.docx,
testOCR.pdf, testOCR.pptx
> I don't know of any decent open source pure Java OCR libraries, but there are command
line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked
by Tika to extract text content (where available) from image files.

This message was sent by Atlassian JIRA

View raw message