tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Timo Boehme (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-93) OCR support
Date Mon, 10 Feb 2014 09:09:19 GMT

    [ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13896339#comment-13896339

Timo Boehme commented on TIKA-93:

I would like to give some comments on detecting/handling of image based PDFs because the proposed
solution will only work with a subset of these kind of documents. First one could classify
the image based PDF into 3 classes:
# image only (one image per page)
# image with text overlay/underlay already produced by an OCR process
# multiple images per page (instead of one full page image there are images per word/line/paragraph)

Thus from only testing for a page size image one does not known if we nevertheless have parseable
text or if we have a class 3 document (in case of e.g. journals we might even have a full
page background image). For an automatic classification one would need to first try to parse
text in the standard way for a view pages. One should not expect image-only PDFs to contain
no text - in some cases header/footer/page numbers are added as text whereas other content
is only an image. An heuristic threshold are 60-80 characters per page below which we can
assume to have an image PDF.
If a PDF is assumed to be an image PDF the pages should be 'printed' into an image (in order
to also handle class 3 documents and to keep mixed data (image + text)) and this image should
be processed by OCR.


> OCR support
> -----------
>                 Key: TIKA-93
>                 URL: https://issues.apache.org/jira/browse/TIKA-93
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>         Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, testOCR.docx,
testOCR.pdf, testOCR.pptx
> I don't know of any decent open source pure Java OCR libraries, but there are command
line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked
by Tika to extract text content (where available) from image files.

This message was sent by Atlassian JIRA

View raw message