tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (TIKA-93) OCR support
Date Sun, 23 Aug 2009 10:51:59 GMT

    [ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746586#action_12746586

Jukka Zitting commented on TIKA-93:

> are there any updates regarding this issue?

Not really. I've done some simple tests with ExternalParser invoking Tesseract and OCRopus,
but neither is really suited for simple OOTB integration.

I also tried the commercial Asprise OCR SDK (http://asprise.com/product/ocr/index.php?lang=java)
which was much easier to set up and get reasonable results from, but obviously it's something
that we can't use in an Apache project.

If someone wants to help with this, the first step would be to come up with reasonably simple
steps to get a liberally licensed OCR engine like OCRopus installed and configured so that
you can invoke it using a simple command line like "ocr image.gif" and get the extracted text
on the standard output. It should work for at least a few simple test cases. Note that this
work should be contributed back to the upstream project.

Once we have something like that, we can move forward with integrating it to Tika.

> OCR support
> -----------
>                 Key: TIKA-93
>                 URL: https://issues.apache.org/jira/browse/TIKA-93
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Priority: Minor
> I don't know of any decent open source pure Java OCR libraries, but there are command
line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked
by Tika to extract text content (where available) from image files.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message