tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Loris Bachert (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1729) OCR in PDF files
Date Thu, 03 Sep 2015 13:25:47 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729032#comment-14729032
] 

Loris Bachert commented on TIKA-1729:
-------------------------------------

Unfortunately i already tried this solution and it's not working aswell. Do you have any other
ideas/solutions how i can fix this problem?

> OCR in PDF files
> ----------------
>
>                 Key: TIKA-1729
>                 URL: https://issues.apache.org/jira/browse/TIKA-1729
>             Project: Tika
>          Issue Type: Improvement
>          Components: config, parser
>    Affects Versions: 1.9, 1.10
>         Environment: Windows 7, 64-bit, JDK 1.8.0_51 64-bit
> Windows 10, 64-bit, JDK 1.8.0_51 32-bit
>            Reporter: Loris Bachert
>              Labels: java, ocr, parser, pdf
>
> As described in this [stackoverflow-post|http://stackoverflow.com/questions/32354209/apache-tika-extract-scanned-pdf-files]
i'm having troubles extracting text out of scanned PDF files. By scanned PDF files i mean
PDF files that consist only of images. Because each page is an image i can't extract them
using a custom ParsingEmbeddedDocumentExtractor. I also tried using the setExtractInlineImages
method of the PDFParserConfig but this didn't work aswell.
> There was already a [ticket|https://issues.apache.org/jira/browse/TIKA-93] regarding
the OCR support and including the [PDF file|https://issues.apache.org/jira/secure/attachment/12627866/testOCR.pdf]
i'm using for my tests.
> Here is a JUnit-test about my issue:
> {code:title=PDFOCRTest.java|borderStyle=solid}
> @Test
> public void testPDFOCRExtraction() throws IOException, SAXException, TikaException {
> 	File file = new File(filePath);
> 	InputStream stream = new FileInputStream(file);
> 	
> 	BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
> 	Metadata metadata = new Metadata();
> 	PDFParserConfig config = new PDFParserConfig();
> 	config.setExtractInlineImages(true);
> 	ParseContext context = new ParseContext();
> 	context.set(PDFParserConfig.class, config);
> 	
> 	PDFParser pdfParser = new PDFParser();
> 	pdfParser.setPDFParserConfig(config);
> 	pdfParser.parse(stream, handler, metadata, context);
> 	String text = handler.toString().trim();
> 	assertFalse(text.isEmpty());
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message