tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (TIKA-2256) Japanese character substituted when reading PDF
Date Thu, 22 Jun 2017 19:37:00 GMT

     [ https://issues.apache.org/jira/browse/TIKA-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Tim Allison resolved TIKA-2256.
-------------------------------
    Resolution: Not A Problem

Thanks to [~tilman], I think we've figured this out.  Likely a bug in MSWord for Mac's PDF
generation code.  [~ccreutzig], when you file a bug report with MS, you might mention that
these two characters can be conflated via Unicode normalization rules...Looks like Mac's MSWord
is going backwards, though...

> Japanese character substituted when reading PDF
> -----------------------------------------------
>
>                 Key: TIKA-2256
>                 URL: https://issues.apache.org/jira/browse/TIKA-2256
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.14
>            Reporter: Christopher Creutzig
>         Attachments: mixed-fonts.pdf
>
>
> The attached file contains “日本語” in its first line. It was created on Mac OS
X 10.11.6 by selecting “Save As PDF” in the system print dialog started from Microsoft
Word.
> Reading the text from the PDF, the first character is not read as U+65E5, but as U+2F47.
Copy & paste from Preview.App results in the correct U+65E5 being copied. (The characters
look the same in some fonts, but are different.)
> The MATLAB code used for reading looks as follows:
>   handler = org.apache.tika.sax.ToXMLContentHandler;
>   parser = org.apache.tika.parser.AutoDetectParser;
>   metadata = org.apache.tika.metadata.Metadata;
>   fh = java.io.FileInputStream(fullname);
>   parser.parse(fh, handler, metadata);
>   s = handler.toString;



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message