tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Luis Filipe Nassif (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-93) OCR support
Date Fri, 21 Feb 2014 22:13:22 GMT

    [ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13908877#comment-13908877
] 

Luis Filipe Nassif commented on TIKA-93:
----------------------------------------

Another approach would be to include images and pdf into supportedTypes of OCRParser and call
their respective parsers within the OCRParser, instead of modifying the code of existing parsers.


About enabling and configuring the OCRParser, it could be included in tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
and could be passed a OCRConfig object via parseContext. If not enabled, OCRParser could simply
call the existing image or pdf parser.

I agree with Timo that it would be better to print pdf to images rather than iterate over
its objects.

Finally, Tesseract already includes support for tif files.

> OCR support
> -----------
>
>                 Key: TIKA-93
>                 URL: https://issues.apache.org/jira/browse/TIKA-93
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.6
>
>         Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, testOCR.docx,
testOCR.pdf, testOCR.pptx
>
>
> I don't know of any decent open source pure Java OCR libraries, but there are command
line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked
by Tika to extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message