tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Angela Onslow (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (TIKA-2054) Problem with ligatures converting from PDF to HTML with Tika
Date Fri, 12 Aug 2016 11:48:20 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15418723#comment-15418723
] 

Angela Onslow edited comment on TIKA-2054 at 8/12/16 11:48 AM:
---------------------------------------------------------------

Here is a file which demonstrates this problem (see attachments)



was (Author: angela@erevalue.com):
Here is a file which demonstrates this problem

> Problem with ligatures converting from PDF to HTML with Tika
> ------------------------------------------------------------
>
>                 Key: TIKA-2054
>                 URL: https://issues.apache.org/jira/browse/TIKA-2054
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.11, 1.13
>            Reporter: Angela Onslow
>         Attachments: 2482_2014_DAVIDE+CAMPARI-MILANO+SPA_SUSTY-AR.pdf
>
>
> When converting certain PDFs from PDF to HTML I am having trouble with ligature characters
being displayed as U+FFFD � REPLACEMENT CHARACTER
> I have tried using Apache Tika 1.11 and 1.13, converting on the command line using the
.jar and get the same results.
> If I use pdfbox-app-2.0.1.jar and 'ExtractText' with the icu4j-57_1.jar in the path and
I convert to text rather than HTML then I am able to at least preserve information about what
each ligature was originally, even if they are still represented as unprintable characters.

> I.e. if I run the following from the command line:
> java -jar pdfbox-app-1.8.12.jar ExtractText 'test.pdf' 'test.txt'
> Then the resulting test.txt when viewed in Sublime2 has "fi" represented as the  US (unit
separator character), "ff" represented as RS, "fl" represented as GS and "ffl" reperesented
as FS, which I could then replace with the appropriate characters.
> I was under the impression Tika uses icu4j, is there a way to get the same behaviour
I see with PDFBox with Tika when converting from PDF to HTML? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message