tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-911) Converted PDF document contains question marks in place of spaces and inconsistent case
Date Wed, 02 May 2012 18:42:49 GMT

    [ https://issues.apache.org/jira/browse/TIKA-911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13266801#comment-13266801
] 

Michael McCandless commented on TIKA-911:
-----------------------------------------

So strange ... I tested on a Mac (10.6.8) with Java 1.6.0_31, and I don't see the ? for spaces
nor the mixed case.

Hmm, my header has a different content-length then yours:

{noformat}
<meta name="xmpTPg:NPages" content="2"/>
<meta name="Creation-Date" content="2012-05-02T10:25:00Z"/>
<meta name="created" content="Wed May 02 06:25:00 EDT 2012"/>
<meta name="Content-Length" content="639985"/>
<meta name="Last-Modified" content="2012-05-02T10:25:00Z"/>
<meta name="producer" content="Mac OS X 10.6.8 Quartz PDFContext"/>
<meta name="Content-Type" content="application/pdf"/>
<meta name="resourceName" content="Rust Biosecurity Brochure.pdf"/>
<meta name="creator" content="Adobe InDesign CS2 (4.0.5)"/>
{noformat}

OK! If I used the PDF attached to the issue, I indeed see these problems (I had downloaded
from the web site).  Maybe the web site has since changed/fixed the PDF?  Hmm.

So, the extra characters (where there should be spaces) are U+FFFD (the unicode replacement
character); Tika outputs this whenever there is a character it can't safely output into the
XHTML (this is done in SafeContentHanderl.java).  Tika used to (before 0.10) simply replace
such characters with space (ASCII 32), so, to get back to pre-0.10 behaviour you can replace
U+FFFD with space.

Not sure about the mixed case issue...

                
> Converted PDF document contains question marks in place of spaces and inconsistent case
> ---------------------------------------------------------------------------------------
>
>                 Key: TIKA-911
>                 URL: https://issues.apache.org/jira/browse/TIKA-911
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.1
>            Reporter: Matt Sheppard
>         Attachments: Rust Biosecurity Brochure.pdf, Rust Biosecurity Brochure.pdf.html
>
>
> The PDF document at http://www.grdc.com.au/uploads/documents/Rust%20Biosecurity%20Brochure.pdf,
when converted with tika v1.1 using
> {code}
> $ java -jar tika-app-1.1.jar Rust\ Biosecurity\ Brochure.pdf
> {code}
> Produces substantially worse output than xpdf's pdftotext program.
> Specifically, we see...
> Some 'spaces' replaced with question marks
> {noformat}
> ...
> <body><div class="page"><p/>
> <p>How can I help?
> When you're overseas:
> • ?wherever?possible,?don't?visit?crops?—?contact?with?
> </p>
> <p>growing?crops?greatly?increases?the?risk?of?contaminating?
> footwear?or?clothing;?
> ...
> {noformat}
> and some odd case conversions
> {noformat}
> <p>stem rust in wheat.  
>  (soURce: BRAd collIs)</p>
> <p/>
> </div>
> {noformat}
> (The original document seems to contain "SOURCE: BRAD COLLIS" all in upper case.
> To compare that with pdftotext
> {code}
> $ ./xpdfbin-linux-3.03/bin32/pdftotext -enc UTF-8 -q ~/Rust\ Biosecurity\ Brochure.pdf
> {code}
> This does not output the question marks, and produces "Source: BRAD COLLIS" at the end
there, both of which seem to be improvements. Note that it does, however, produce a number
of ^G characters which are not desireable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message