tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-93) OCR support
Date Fri, 07 Feb 2014 21:43:21 GMT

    [ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13895083#comment-13895083
] 

Chris A. Mattmann commented on TIKA-93:
---------------------------------------

Thanks Grant, obtaining glory is win. 
Still sounds like a Parser to me though, but I'll be interested to see if you whip out some
patches and what they would look like. The nice thing about Parsers is that they spit out
XHTML and you can then transform it with ContentHandlers, which is where the real pipeline
in Tika capabilities are. So moving into Parser ville gets you a pipeline effect downstream
at least.

> OCR support
> -----------
>
>                 Key: TIKA-93
>                 URL: https://issues.apache.org/jira/browse/TIKA-93
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Priority: Minor
>
> I don't know of any decent open source pure Java OCR libraries, but there are command
line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked
by Tika to extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message