tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Mastarone (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-723) Rotated text isn't extracted correctly from PDFs
Date Fri, 25 Nov 2011 02:58:40 GMT

    [ https://issues.apache.org/jira/browse/TIKA-723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13156968#comment-13156968

John Mastarone commented on TIKA-723:

With the latest source, I tried adding the line         
"if (parser instanceof org.apache.tika.parser.pdf.PDFParser){ ((org.apache.tika.parser.pdf.PDFParser)parser).setSortByPosition(true);}"
to the CompositeParser class, inside the parse method, right after the line "Parser parser
= getParser(metadata);" and also had to add tika-parser as a dependency to the core. Then
after building the core jar and tika-app, the text was no longer inappropriately vertical
when using the GUI.  It appeared that none of the other PDFs in the test-resources folder
were being parsed incorrectly, except for the first one (testAnnotations.pdf) which fails
to parse entirely--but it also fails to parse with an unmodified, most-recent version of the
Tika GUI, due to the same NPE in both cases.  I don't know if there's a JIRA item for this
yet or not. Also, I downloaded the PDFBox application jar and ran ExtractText with the -sort
option, and this properly rotated the text in your rotated.pdf file. 

After making the change to CompositeParser that I made, two test cases failed in tika-parsers,
lines 147 and 180 of PDFParserTest.java which concern testPDFTwoTextBoxes.pdf and a table
in testPDFVarious.pdf.  However, the assertions made in these lines are arguably up for interpretation:
should the tika pdf parser really print all of the items in a column before moving onto the
next column?  The change I made results in all elements of a given row being printed before
moving onto the next row (row major order instead of column major).  This could be fine for
the table in testPDFVarious.pdf, but maybe less so for the two text boxes in the other PDF?

So, I'm not experienced with Tika development at all, but perhaps a line (or lines) like the
one above should be somewhere in the code--if not in the CompositeParser, then elsewhere,
depending on what you and/or others think about the test cases that would fail as a result.
> Rotated text isn't extracted correctly from PDFs
> ------------------------------------------------
>                 Key: TIKA-723
>                 URL: https://issues.apache.org/jira/browse/TIKA-723
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: rotated.pdf
> I have an example PDF with 90 degree rotation; Tika produces the
> characters one line at a time.  Ie, the doc has "Some rotated text,
> here!" but Tika produces this:
> {noformat}
> <body><div class="page"><p>So
> m
> e
> r
> o
> t
> a
> t
> e
> d
> t
> e
> x
> t
> ,
> h
> e
> r
> e
> !</p>
> {noformat}
> I'm able to copy/paste the text out correctly.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message