tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1671) Wrapped lines in PDF files not processed correctly
Date Wed, 01 Jul 2015 15:33:04 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14610474#comment-14610474

Tim Allison commented on TIKA-1671:

Thank you for raising this.  Please see TIKA-1641 for the same type of issue, I think.  If
you can give pure PDFBox-app's ExtractText a try and see if you get the same result, that'd
be great.  If you get the same result, then unfortunately, it is beyond the scope of Tika
to recombine lines.  If you get what you want, then there may be something in Tika that we
can fix.

> Wrapped lines in PDF files not processed correctly
> --------------------------------------------------
>                 Key: TIKA-1671
>                 URL: https://issues.apache.org/jira/browse/TIKA-1671
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.9
>            Reporter: James Baker
>              Labels: pdf, wrapping
>         Attachments: Test Document.pdf
> Text that wraps over multiple lines in PDF documents is not extracted correctly by Tika.
The expected behaviour would be for it to be extracted as a single line, but instead a line
break is inserted at each wrap point.
> This makes it hard, if not impossible, to reassemble text into it's intended form, as
it is not known whether a line break in the extracted text is one that appeared in the document
or one that was inserted by Tika.

This message was sent by Atlassian JIRA

View raw message