tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tyler Palsulich (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-911) Converted PDF document contains question marks in place of spaces and inconsistent case
Date Mon, 02 Mar 2015 04:27:04 GMT

    [ https://issues.apache.org/jira/browse/TIKA-911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342716#comment-14342716
] 

Tyler Palsulich commented on TIKA-911:
--------------------------------------

Still seeing this issue (question marks instead of spaces) on a Mac with Tika 1.8-SNAPSHOT.

{{mvn -version}}:
{code}
Apache Maven 3.2.3 (33f8c3e1027c3ddde99d3cdebad2656a31e8fdf4; 2014-08-11T16:58:10-04:00)
Maven home: /usr/local/Cellar/maven/3.2.3/libexec
Java version: 1.7.0_71, vendor: Oracle Corporation
Java home: /Library/Java/JavaVirtualMachines/jdk1.7.0_71.jdk/Contents/Home/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "mac os x", version: "10.10.2", arch: "x86_64", family: "mac"
{code}

> Converted PDF document contains question marks in place of spaces and inconsistent case
> ---------------------------------------------------------------------------------------
>
>                 Key: TIKA-911
>                 URL: https://issues.apache.org/jira/browse/TIKA-911
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.8
>            Reporter: Matt Sheppard
>         Attachments: Rust Biosecurity Brochure.pdf, Rust Biosecurity Brochure.pdf.html
>
>
> The PDF document at http://www.grdc.com.au/uploads/documents/Rust%20Biosecurity%20Brochure.pdf,
when converted with tika v1.1 using
> {code}
> $ java -jar tika-app-1.1.jar Rust\ Biosecurity\ Brochure.pdf
> {code}
> Produces substantially worse output than xpdf's pdftotext program.
> Specifically, we see...
> Some 'spaces' replaced with question marks
> {noformat}
> ...
> <body><div class="page"><p/>
> <p>How can I help?
> When you're overseas:
> • ?wherever?possible,?don't?visit?crops?—?contact?with?
> </p>
> <p>growing?crops?greatly?increases?the?risk?of?contaminating?
> footwear?or?clothing;?
> ...
> {noformat}
> and some odd case conversions
> {noformat}
> <p>stem rust in wheat.  
>  (soURce: BRAd collIs)</p>
> <p/>
> </div>
> {noformat}
> (The original document seems to contain "SOURCE: BRAD COLLIS" all in upper case.
> To compare that with pdftotext
> {code}
> $ ./xpdfbin-linux-3.03/bin32/pdftotext -enc UTF-8 -q ~/Rust\ Biosecurity\ Brochure.pdf
> {code}
> This does not output the question marks, and produces "Source: BRAD COLLIS" at the end
there, both of which seem to be improvements. Note that it does, however, produce a number
of ^G characters which are not desireable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message