tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthew Caruana Galizia (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2175) Enable extraction of inlined jp2/jpx from PDF
Date Fri, 25 Nov 2016 17:42:58 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15696377#comment-15696377

Matthew Caruana Galizia commented on TIKA-2175:

Still no joy, both with my bridge classes and with tika-app from trunk. It seems the images
in the PDF are skipped over entirely. I don't think that the embedded document parsing handler
is ever even invoked. I've attached the PDF in question. If you open it in a hex editor, you
can see that the files are declared to be "jp2" format.

> Enable extraction of inlined jp2/jpx from PDF
> ---------------------------------------------
>                 Key: TIKA-2175
>                 URL: https://issues.apache.org/jira/browse/TIKA-2175
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Tim Allison
>         Attachments: pdf-with-jp2-images.pdf
> On TIKA-2174, [~mcaruanagalizia] reported that inline jp2 images in PDFs were not being
OCR'd.  TIKA-2174 added that file type to our tesseract parser, but we our code in the PDFParser
wasn't extracting the inline images as well.  Let's fix that. 

This message was sent by Atlassian JIRA

View raw message