tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Raymond Wu (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-1679) Parse PDF file page by page
Date Wed, 15 Jul 2015 09:10:05 GMT
Raymond Wu created TIKA-1679:
--------------------------------

             Summary: Parse PDF file page by page
                 Key: TIKA-1679
                 URL: https://issues.apache.org/jira/browse/TIKA-1679
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 1.9
            Reporter: Raymond Wu


I have a PDF file contains 5 pages.
Page 3 cannot be parsed by PDFBox, but the rest pages are okay.
So I try to parse this file page by page.
Fix method PDF2XHTML.process() at PDF2XHTML.java.

public static void process(
            PDDocument document, ContentHandler handler, Metadata metadata,
            boolean extractAnnotationText, boolean enableAutoSpace,
            boolean suppressDuplicateOverlappingText, boolean sortByPosition)
            throws SAXException, TikaException {
        try {
            // Extract text using a dummy Writer as we override the
            // key methods to output to the given content
            // handler.

            Writer dummyWriter = new Writer() {
                @Override
                public void write(char[] cbuf, int off, int len) {
                }
                @Override
                public void flush() {
                }
                @Override
                public void close() {
                }
            };

            // Parse page by page
            int nop = document.getNumberOfPages();
            for(int i=1;i<=nop;i++) {
                PDF2XHTML pdf2XHTML = new PDF2XHTML(handler, metadata,
                extractAnnotationText, enableAutoSpace,
                suppressDuplicateOverlappingText, sortByPosition);
                try {
                    pdf2XHTML.setStartPage(i);
                    pdf2XHTML.setEndPage(i);
                    pdf2XHTML.writeText(document, dummyWriter);
                } catch(Exception e) {
                    // TODO ...
                }
            }
        } catch (IOException e) {
            if (e.getCause() instanceof SAXException) {
                throw (SAXException) e.getCause();
            } else {
                throw new TikaException("Unable to extract PDF content", e);
            }
        }
    }

This method can parse PDF with partial broken pages.
I know It's not an optimized design.
But it is enough to solve my problem.
>From Tika 1.4~1.9, I need to recompile every version for this problem.
So I'd like to improve this parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message