tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthew Caruana Galizia (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2175) Enable extraction of inlined jp2/jpx from PDF
Date Thu, 10 Nov 2016 08:38:58 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15653441#comment-15653441

Matthew Caruana Galizia commented on TIKA-2175:

I've filed [an issue|https://github.com/jai-imageio/jai-imageio-jpeg2000/issues/8] with the
jpeg2000 imageio project to declare jpx support. The decode/encoders support that format -
the issue is simply that it's not declared so PDFBox doesn't find them.

As a temporary workaround and proof of concept I've added these two bridge Spi classes: https://github.com/ICIJ/extract/tree/master/src/main/java/org/icij/imageio/jpx

> Enable extraction of inlined jp2/jpx from PDF
> ---------------------------------------------
>                 Key: TIKA-2175
>                 URL: https://issues.apache.org/jira/browse/TIKA-2175
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Tim Allison
> On TIKA-2174, [~mcaruanagalizia] reported that inline jp2 images in PDFs were not being
OCR'd.  TIKA-2174 added that file type to our tesseract parser, but we our code in the PDFParser
wasn't extracting the inline images as well.  Let's fix that. 

This message was sent by Atlassian JIRA

View raw message