tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Grant Ingersoll (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-93) OCR support
Date Sat, 08 Feb 2014 00:01:25 GMT

    [ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13895276#comment-13895276
] 

Grant Ingersoll commented on TIKA-93:
-------------------------------------

Well, Tesseract is out, at least as far as using Tess4j goes, as it has LGPL and BCL dependencies.
 Ugh, especially since Tesseract itself is ASL.   And here Tesseract looks so promising, at
least in my initial tests (compared to JavaOCR, which requires a bunch of training work up
front)

> OCR support
> -----------
>
>                 Key: TIKA-93
>                 URL: https://issues.apache.org/jira/browse/TIKA-93
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Priority: Minor
>
> I don't know of any decent open source pure Java OCR libraries, but there are command
line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked
by Tika to extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message