tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2175) Enable extraction of inlined jp2/jpx from PDF
Date Thu, 10 Nov 2016 13:40:58 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15654065#comment-15654065

Tim Allison commented on TIKA-2175:

After I made the change recommended by [~tilman], I'm able to run OCR via extraction of the
inline jpx on [this file|https://t.co/yx3GRe2e6w] without the bridge classes.  

Can you try trunk against your test file(s)? My "test" with the linked file renamed to {{testOCR_jp2.pdf}}
looked like this:

    public void testOneOff() throws Exception {
        ParseContext context = new ParseContext();
        PDFParserConfig parserConfig = new PDFParserConfig();
        context.set(PDFParserConfig.class, parserConfig);
        debug(getRecursiveMetadata("testOCR_jp2.pdf", context));

> Enable extraction of inlined jp2/jpx from PDF
> ---------------------------------------------
>                 Key: TIKA-2175
>                 URL: https://issues.apache.org/jira/browse/TIKA-2175
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Tim Allison
> On TIKA-2174, [~mcaruanagalizia] reported that inline jp2 images in PDFs were not being
OCR'd.  TIKA-2174 added that file type to our tesseract parser, but we our code in the PDFParser
wasn't extracting the inline images as well.  Let's fix that. 

This message was sent by Atlassian JIRA

View raw message