tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthew Caruana Galizia (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2175) Enable extraction of inlined jp2/jpx from PDF
Date Mon, 28 Nov 2016 13:08:58 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15701919#comment-15701919

Matthew Caruana Galizia commented on TIKA-2175:

The problem was OpenCL support in Tesseract. Once I rebuilt Tesseract without OpenCL support,
I got the same results as you above, but using setExtractInlineImages(true) instead of setOcrStrategy(...).
Thank you for testing.

> Enable extraction of inlined jp2/jpx from PDF
> ---------------------------------------------
>                 Key: TIKA-2175
>                 URL: https://issues.apache.org/jira/browse/TIKA-2175
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Tim Allison
>         Attachments: pdf-with-jp2-images.pdf
> On TIKA-2174, [~mcaruanagalizia] reported that inline jp2 images in PDFs were not being
OCR'd.  TIKA-2174 added that file type to our tesseract parser, but we our code in the PDFParser
wasn't extracting the inline images as well.  Let's fix that. 

This message was sent by Atlassian JIRA

View raw message