tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Grant Ingersoll (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-93) OCR support
Date Sun, 09 Feb 2014 13:48:19 GMT

     [ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Grant Ingersoll updated TIKA-93:

    Attachment: testOCR.pptx

Not sure if this is progress or not...  

The testOCR.* files need to go in the parsers/src/test/resources/test-documents directory.

Things that changed:
# Moved config to ParseContext instead of one off implementation in PDFParserConfig.
# Used the existing ParseContext for passing in the OCRParser instead of separate handling
# Added some more test files.  Will upload them.

Things I could use help on:
# Trying to get this integrated into the Office stuff.  I see the DELEGATING_PARSER capabilities
for embedded extraction, but not quite sure about how to best leverage that.  See JavaOCRParserTest.testOCR
for some attempts at setting up the test
# Overall, my biggest lack of understanding is around how to configure this stuff.  As I see
it, we need to be able to set 2 things: 
## The OCRParser or Delegatingparser.  I'm not sure how embedded contexts are used in practice.
 Note, some of the OCRParser implementations will require configuration/training before they
can be used.
## Whether or not to actually use the OCRParser (a boolean flag), as OCR is expensive and
not everyone will want it for every doc, etc.

> OCR support
> -----------
>                 Key: TIKA-93
>                 URL: https://issues.apache.org/jira/browse/TIKA-93
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>         Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, testOCR.docx,
testOCR.pdf, testOCR.pptx
> I don't know of any decent open source pure Java OCR libraries, but there are command
line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked
by Tika to extract text content (where available) from image files.

This message was sent by Atlassian JIRA

View raw message