tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Pearcy (JIRA)" <j...@apache.org>
Subject [jira] Commented: (TIKA-548) PDF content extracted as single line
Date Wed, 29 Dec 2010 18:45:48 GMT

    [ https://issues.apache.org/jira/browse/TIKA-548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12975864#action_12975864
] 

Paul Pearcy commented on TIKA-548:
----------------------------------

+1 for a 8.1 release, unless the 9.0 is imminent. 

Thanks!

> PDF content extracted as single line
> ------------------------------------
>
>                 Key: TIKA-548
>                 URL: https://issues.apache.org/jira/browse/TIKA-548
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.8
>            Reporter: Staffan Olsson
>            Assignee: Jukka Zitting
>             Fix For: 0.9
>
>         Attachments: test.pdf, tika-PDF-content-regression-test.patch
>
>
> Rev 1029510 introduces a regression in PDF content parsing, now present in 0.8 RC. Paragraphs
from the PDF are no longer separated by newline. This is a problem both for reading and for
indexing. See the attached test.
> Note that it seems like PDFBox 1.3.1 extracts correctly, at least from command line.
Here's from a sample file with a headline followed by a one word paragraph:
> $> java -jar pdfbox-app-1.3.1.jar ExtractText -console docs/shortpdf.pdf
> 1   -    untitled 3   -    2010-02-13 09:52   -    Staffan Olsson
> PDF Title For Short Document
> veryshortpdfcontents
> But Tika prints:
> $> java -jar tika-app-0.9-20101110.175016-3.jar docs/shortpdf.pdf
> ...
> <p>1   -    untitled 3   -    2010-02-13 09:52   -    Staffan OlssonPDF
> Title For Short Documentveryshortpdfcontents</p>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message