tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Burch (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-93) OCR support
Date Sun, 09 Feb 2014 13:27:20 GMT

    [ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13895897#comment-13895897
] 

Nick Burch commented on TIKA-93:
--------------------------------

Generally speaking, when a parser finds embedded resources, it calls out to the Parser on
the context to have it processed. You could therefore set your OCR Parser there, and it'd
be called for all kinds of embedded resources. It can then OCR any suitable images it finds,
and pass on everything else to another parser (eg DefaultParser) to have the non-OCR-able
embedded parts handled (if required)

To handle OCRing of top level content, eg images, you'd need to register your OCR parser as
the parser for those types, in place of (or possibly even wrapping) the default parser.

> OCR support
> -----------
>
>                 Key: TIKA-93
>                 URL: https://issues.apache.org/jira/browse/TIKA-93
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>         Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch
>
>
> I don't know of any decent open source pure Java OCR libraries, but there are command
line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked
by Tika to extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message