tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James Baker (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-1671) Wrapped lines in PDF files not processed correctly
Date Wed, 01 Jul 2015 13:56:05 GMT
James Baker created TIKA-1671:
---------------------------------

             Summary: Wrapped lines in PDF files not processed correctly
                 Key: TIKA-1671
                 URL: https://issues.apache.org/jira/browse/TIKA-1671
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.9
            Reporter: James Baker


Text that wraps over multiple lines in PDF documents is not extracted correctly by Tika. The
expected behaviour would be for it to be extracted as a single line, but instead a line break
is inserted at each wrap point.

This makes it hard, if not impossible, to reassemble text into it's intended form, as it is
not known whether a line break in the extracted text is one that appeared in the document
or one that was inserted by Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message