tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Grant Ingersoll (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-93) OCR support
Date Sat, 08 Feb 2014 20:17:19 GMT

    [ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13895718#comment-13895718

Grant Ingersoll commented on TIKA-93:

bq. what is the dependency on jacoco in tika-parent? That stuff seems orthogonal to the patch.

I put that in so that I can measure whether I am testing sufficiently.  I can separate it
out to a different patch.

bq. dependency on custom external Maven repo – myGrid – any way to get the jar from the
Central repo somewhere? we have made an effort in Tika to remove any specific deps on external

We could make that one optional.  All it does is add support for TIFF and a few other file
formats that aren't part of the standard ImageIO.

bq.  in my CS572 class on Search Engines where we look at FBI Vault PDF files!  http://www-scf.usc.edu/~csci572/

I read your abstract for your talk and checked out the Vault and thought it would be cool,
too.  The main issue is that JavaOCR needs to be trained in order to work with that data set.
 Tesseract, on the other hand, works for it, but alas, needs to be implemented as an OCRParser.
 Since Tess4J has some bad deps, the only way I could see to do this is to exec the process
or go write my own JNI integration for Tesseract.  The latter isn't likely to happen.  The
former feels less than desirable, but would work.

> OCR support
> -----------
>                 Key: TIKA-93
>                 URL: https://issues.apache.org/jira/browse/TIKA-93
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>         Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch
> I don't know of any decent open source pure Java OCR libraries, but there are command
line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked
by Tika to extract text content (where available) from image files.

This message was sent by Atlassian JIRA

View raw message