tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Mandalka (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2749) OCR on PDFs should "just work" out of the box
Date Wed, 02 Jan 2019 13:19:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16732043#comment-16732043

Markus Mandalka commented on TIKA-2749:

Another nice thing would be to cache OCR results of images (which i am doing by an Open Semantic
ETL option by plain text cache files named by hash of the image/maybe in future in an Solr
index) so same images which are in many documents like logos would not be OCRd multiple/many

> OCR on PDFs should "just work" out of the box
> ---------------------------------------------
>                 Key: TIKA-2749
>                 URL: https://issues.apache.org/jira/browse/TIKA-2749
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
> There are now two different ways (with various parameters) to trigger OCR on inline images
within PDFs.  The user has to 1) understand that these are available and then 2) elect to
turn one of those on.
> I think we should make OCR'ing on PDFs "just work" perhaps with a hybrid strategy between
the 2 options.  Users should still be allowed to configure as they wish, of course. 

This message was sent by Atlassian JIRA

View raw message