tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-93) OCR support
Date Sat, 08 Feb 2014 19:25:26 GMT

    [ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13895698#comment-13895698
] 

Chris A. Mattmann commented on TIKA-93:
---------------------------------------

Hey Grant, patch is looking good! I will need to download it and test it out, but this is
just based on a cursory inspection.
Some comments:
# what is the dependency on jacoco in tika-parent? That stuff seems orthogonal to the patch.
# maybe think about providing the training directory as part of the ParseContext (maybe a
property like o.a.tika.parser.ocr.trainingDataDirPath?)
# dependency on custom external Maven repo -- myGrid -- any way to get the jar from the Central
repo somewhere? we have made an effort in Tika to remove any specific deps on external repositories,
see: http://blog.sonatype.com/2010/03/why-external-repos-are-being-phased-out-of-central/#.UvaEN0JdWxU

Looking great. Maybe we can get some of this in 1.6 even with the deps on the external repo
but we need to get rid of those before releasing. I will try this out in a few hours! I'm
excited b/c I may even be able to use this for the homework assignments in my CS572 class
on Search Engines where we look at FBI Vault PDF files! :) http://www-scf.usc.edu/~csci572/


> OCR support
> -----------
>
>                 Key: TIKA-93
>                 URL: https://issues.apache.org/jira/browse/TIKA-93
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>         Attachments: TIKA-93.patch, TIKA-93.patch
>
>
> I don't know of any decent open source pure Java OCR libraries, but there are command
line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked
by Tika to extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message